LWL-prints: Word lists

Overview

Teaching: 10 min
Exercises: 15 min
Questions
  • What are word lists in AntConc and when would you use it?

  • How do word lists work in AntConc?

  • How might I interpret word lists generated from catalogue data?

Objectives
  • Explain what word lists are in AntConc

  • Use word lists to start identifying the lingustic character of a corpus

This is a development version of the training module based on the LWL_prints-data.txt dataset (a single corpus of around 37k words). For information on how we processed the .txt files in LWL_prints-data.txt for use in AntConc, see *CatalogueLegacies/transmission: Analysis of transmission from BMSat to LWL*.

Word lists

A word list counts how many times each word occurs in the selected text(s). Generally, in a word list we expect the most frequent words to be function words, e.g. for English-language texts, words like “the”, “of”, “and”, “but”, etc. However, for texts that are restricted by topic, genre and/or text type - which we would expect descriptive text in catalogue data to be - then the order of these words can vary and there will also be domain-specific vocabulary.

Word lists then are a useful starting point for getting an overview of the lingustic features of a corpus. For example, if you have a corpus that includes minor variations in data values - e.g. names of people, organisations, places, classification terms - creating a word list can be an effective way of spotting those variants. In the case of a large catalogue, it can also be useful for finding linguistic indicators of divergent practices within the corpus, which might indicate that entries were written by different people at different times, or those that some entries contain text quoted from third-party sources.

Making a word list

To use the ‘Word List’ function in AntConc, click on the Word List tab and press Start.

Antconc then returns the following information:

You will note that AntConc has treated all text as lowercase. Whilst this can be useful, it means that the word “cook” and the family name “Cook” are treated as the same word type in our count. Case sensitivity is also useful when examining curatorial voice, because knowing where words are used in relation to punctuation is a tell for features like sentence structure and length.

In AntConc you need to change case sensitivity settings for each tool individually. Change this now by going to Tool Preferences, choosing Word List, and unticking Treat all data as lowercase, and then pressing Apply. Back in the Word List tab hit Start again to see the difference.

Correct word list tool preferences

What is a word?

Note here that depending on your dataset, an important setting is Token Definition under Global Settings (found in the top navbar). As we’ve seen, this defines what AntConc sees as a word, e.g. you need to specify that numbers, punctuation characters and symbols can be part of words in order for AntConc to see things like urls or other special characters that provide meaning (e.g. hashtags in tweets).

Interacting with a word list

There are a number of ways in which you can interact with your word list output:

First, turning to the Freq column you can select and highlight frequency values. If you click on an individual word, AntConc will move to the Concordance tab and plot a ‘concordance’ for that word: that is, a list that shows sentences that contain the word you clicked on. To test this choose a lower frequency word (ranked below 1000), click on it, and AntConc will move to the Concordance tab. We will look at concordances in the next episode. For now, move back to the Word List tab and observe that your results haven’t been lost.

Second, you can re-sort the output in the Word List tab using the options in the Sort by area. By default sorts in the Word List tab are set by rank, meaning that the most common word type is shown at the top, and the least common word type is shown at the bottom. The sort can be inverted by ticking Invert Order and pressing Start. Note that you are now presented with a long tail of infrequently used word types.

The sort can also be changed so that rather than sorting by the frequencty of words, the sort is made by the words themselves, listed in alphabetical order. To do this, untick Invert Order, select Sort by Word in the dropdown and hit Start. Browsing through an alphabetical sort can be a useful way of finding errors in the data (e.g. stray punctuation), spelling mistakes, variations in capitalisation, or - thinking of curatorial voice - different lemma forms of words.

Is AntConc thinking or has it crashed?

AntConc can often become non-functional. In most cases the software is processing your request rather than fallen over. For example, when looking at a word list it is ill-advised to click on a very frequent word as AntConc may take a while to process the concordance for that word. Of course, if a very frequent word is one your interested in, you’ll just have to be patient!

Saving your output

To save the output from your Word List go to the File menu in the navbar and select Save Output.

Good practice when saving outputs

All tabs in AntConc provide the option to save the output. This is vital for keeping a record of your findings. Note that the default output is a file called antconc_results.txt. As this contains no information about your corpus, your query, your settings, or what version AntConc you are using, you are encourged to note that information in the output file name each time you save an output. The sad truth is that if you use the default name regularly, you may be more likely to overwrite previous results!

Browsing through the downloaded output can be instructive. In this case, we see many places and names. And as we get below the 500 most common words, some vocabulary choices that might benefit from further investigation: 132 occurrences of the word ‘poorish’ (used it turns out to describe the condition of photographs), 72 occurences of ‘Reconnaissances,’ (almost certainly the name of a secondary source regularly cited in the entry), and 64 occurences of ‘Hon’ble’ (likely text transcribed from a caption).

Tasks

Having learnt about the Word List tab in AntConc, work in pairs or small groups on the following challenges.

Task 1: What % of all word tokens are accounted for by the 30 most common word types?

  • Use the Word List tab to count all word types and then use the output to make an estimate. Note: you can highlight and copy values, and you may need to do some calculation outside of AntConc.

Solution

  1. Remove any text from the search box, select Sort by Freq and hit Start.
  2. Observe the figure of 36791 word tokens. Select the frequency values of the 30 most common word types, paste them into a spreadsheet programme, and sum them. You should get 14882. Use these two figures to calculate a percentage: (14882/36791) x 100 = 40.45%
    • It is common in English language corpora to find that roughly half the corpus is accounted for by a small number of frequent words. This observation goes a long way to explaining why corpus linguists often present and work with lists of ‘top’ words (not that word lists are the only tool in the corpus linguists armoury, as we shall see!). In this case, the slightly low percentage of all word tokens accounted for by the 30 most common word types may indicate a greater volume of frequently used words.

Task 2: What might the variant uses of the verb “stand” infer about language use in the dataset as a whole?

  • Use the Word List tab to count all word types, then use the sorting options to help you navigate to variants of the verb “stand” and their use in context.

Solution

  1. Remove any text from the search box, select Sort by Word and hit Start.
  2. Browse to the string “stand”. There are three word varients of “stand” with frequencies as follows:
    • stand: 33 (sum of “stand”, “stand,”, and “stand.”)
    • standing: 37 (sum of “Standing”, “standing”, and “standing,”)
    • stands: 107
  3. An initial observation can be made that present tense forms of “stand” dominates and the verb has several different meanings.
  4. To explore the uses of the verb, click on “stands”, “stand” or “standing” and look at the output in the Concordance tab.
  5. The verb is used to position people and things (“Another man, standing to their right”) and to represent actions (“An older man stands in a dancing pose”). It is also occasionally used a noun for representing objects (“a barrell on a stand and a chamber pot”).
  6. We can infer - pending further investigation - that the curatorial descriptions in the corpus describe what is in front of the curator rather than, for example, describe the life of the object or its principle characters.

Task 3: Find the variants of the word “behind” and estimate their use relative to each other. Without looking at the concordances for these, discuss what hypotheses might explain this result.

  • Use the Word List tab to count all word types and then use the output to make an estimate.

Solution

  1. Remove any text from the search box, select Sort by Word and hit Start.
  2. Browse to the string “behind”. There are two word variants in the corpus: “behind” and “Behind”. There are 37 instances of “Behind” compared to 96 instances of “behind”. From these numbers we can infer a series of possible hypotheses about the corpus:
    • There are fewer instances of “Behind” being used at the start of a sentence than “behind”, and yet roughly 30% of total uses is high. Common frequency of capitalised variants of prepositions, like “Behind”, may be indicate something about sentence length (because frequent capitalisation of prepositions may be an indicator of short sentence length).
    • Out of 96 instances of non-capitalised “behind”, only 6 appear before some sort of punctuation, such as the end of a clause or a sentence. “behind.” seems particularly low, and clicking on it gives us an insight into a form “behind” rarely used to close a sentence (that may be more common in other corpora).
    • Positional/spatial words might be infrequently used at the end of sentences.

Key Points

  • Word lists are a way of getting an overview of the lingustic features of a corpus

  • AntConc provides a number of options for presenting word lists

  • When using AntConc to count things, we need to be mindful that machine readable strings are not the same as human readable words

  • Outputs from AntConc queries can be saved locally as text files