OverviewTeaching: 10 min
Exercises: 10 minQuestions
What are collocates?
How do collocates work in AntConc?
What conclusions can be reached based on collocation data?Objectives
Learn how collocates work in AntConc
This is an archived version of the training module based on the BM-MDG.zip dataset (a single corpus of around 1.2 million words seperated into parts containing roughly 100,000 words each). For information on how we processed the .txt files in BM-MDG.zip for use in AntConc, see *Creation of the BMSatire Descriptions corpus*
The collocates of a word are those words that tend to occur in proximity to that word more than they occur in proximity to all other words in the corpus. The idea of collocation is implemented using a variety of different statistics to determine the co-occurrence of words.
Click on the
Collocates tab, enter the string “behind” in the search box, ensure that
Words is ticked and
Case is unticked, change
Sort by to
Sort by Freq, then press
Start. After a little time an error will pop up.
AntConc tab interdependencies
- Outputs in the
Collocatestab are based on data generated by the
Word Listtab, so both need to have the same
Tool Preferencesfor the
Collocatestab to work properly.
- In this case, go to
Tool Preferences, untick
Treat all data as lowercase, hit
Apply, and then hit
- The moral here then is that in AntConc any search needs to be undertaken with care. For
Collocatesin particular, you need to know exactly what you’ve searched for in order to read the statistical output.
AntConc then presents a slightly confusing screen. It contains the following information:
- A ranking of words by frequency, specifically - using the default
Collocatessettings - those five words either side of the word “behind”. The proximity to the searched for word can be changed in the bottom right of the screen, more on which shortly.
- The frequency of the word, broken down into two columns: one by frequency to the left of the word “behind”, and one by frequency to the right of the word “behind”.
Stat(more on which shortly).
Browsing this we can start to make some observations, building on similar themes from episodes five and six. We see many commons words (“the”, “a”, “is”, “and”). We see that people (“him”, “his”, “her”) and actions (“stands”, “says”, “holds”) are related to spatial term. And we see a long tail of vocabulary, 5239 of the 8560 unique words (or about 3 in 5) occur only once in proximity to the word “behind’.
Now edit the
To.. settings to
1R respectively, and hit
Start again. A few things stand out:
- Some high frequency words have zero or very small frequencies on one or either side of “behind”;
- Some words have jumped up the list (“them” from 19 to 5, “just” from 51 to 25, “immediately” from 44 to 26);
- There is one word in the top 15 words (“a”) that has a stat value below 1, and “with” (rank 68) has a negative stat value.
This output tells us something about both language use and the subject of the cataloguing.
- The absence of “his behind”, “your behind”, and “my behind” tells us that bums are not described as “behinds” (George tends to prefer “posterior”).
- The dominance of “behind are” over “are behind”, suggests a preference for saying ‘where’ then ‘what’ in a sentence, rather than ‘what’ then ‘where’ then ‘what’ (that is, the relation between things doesn’t cross sentences). If we click on “are” we go - once again - to the
Condordancetab to see examples of this: “Behind are a number of men”, “Behind are flames”, “close behind are eight other judges”, etc, are more common than “The cobbler and his wife are behind a stall” or “His hands are behind him”.
- We are also reminded of the value of retaining capitalization (look at how common it is to see punctuation before “Behind”.
- We can also start to make inferences about the high
statvalues for words that have jumped up the list (“them”, “just”, “immediately”), though to do this properly we need to know more about what the
statvalue means. As a rule, collocation statistics should be read with caution.
Reading stat values
Task 1: What might the stat value signify?
- Note: to solve this problem, start by going to the
Collocatestab for the string “behind” (
1Rrespectively) and observe the stat column. Note that the values around 0.5 and below (and even in negative!) are words like “a”, “and”, “left”: words that we know are common in the corpus. Note also that the higher stat values are for those words that have jumped up the list since we moved from
1L/1R(“them”, “just”, “immediately”).
- The stat value signifies the unusually high or low occurrence of words near the target word, compared to the occurance of those words in the corpus as a whole. So, there are fewer occurrences of “a”
1L/1Rof “behind” than we would expect given the frequency of “a” in the corpus, and a greater number of occurances of “them” or “just”
1L/1Rof “behind” than we would expect given the frequency of “them” and “just” in the corpus
- Note: by default, the
Statcolumn records a ‘Mutual Information’ score, which is a measure of the probability that the collocate and key word occur near to each other, relative to how many times they each occur in total.
Now we have a better sense of what
Stat is doing, change
Sort by to
Sort by Stat and hit
Start. All the top 250 or so ranked works are now those that occur only once or twice
1L/1R of “behind”, and that - as a result - have high stat scores. This isn’t very useful. To work more effectively with
Sort by Stat, change the
Min. Collocate Frequency field to “10” and hit
Start. We now have sensible results - “immediately”, “them”, and “Just” pop to the top, and by browsing the list we can continue to make inferences about both the language used in cataloguing and the subject of that cataloguing:
- Verbs in the present tense forms are prominent.
- Relative spatial arrangements (“behind her”, “stand behind”, “close behind”) are important features of the corpus.
- There are suggestions that the cataloguing used a relatively controlled vocabulary and phrasing: for example, if we click on the word “pen” almost all of the forty-two occurances are for the phrase ‘a pen behind his ear’.
- Proper names (“Sheridan”, “Pitt”, “Napoleon”, “Wellington”) are frequent, tend to occur in proximity to spatial words like “behind”, and tend to appear before the word “behind”, suggesting they appear towards the front of the satirical prints described in the corpus.
From collocation to curatorial voice
Collocates tab enables you to create a statistical overview of a corpus. But small changes to the variables in the
Collocates tab can significantly change the statistics that are produced. The tool then needs to be used with caution.
Task 2: How regularised are the descriptions of clothing, accessories, and body adornments?
- Note: to solve this problem, start by searching in the
Caseunticked). You may need to adjust the other settings to capture the ways that words for clothing (e.g. “hat”) are used in proximity to the verb “to wear”.
- There is no one way of examining this problem. One approach is to edit your
4Rrespectively, chose the
Sort by Statoption, and set the
Min. Collocate Frequencyfield to “25”.
Start. Note that even at this frequency, some very specialised language is present: “biretta” (a type of hat worn by Roman Catholic clergy), “bicorne” (a military hat associated with Napoleon), and “rouges” (presumably with “bonnet” to refer to a type of French revolutionary hat). This indicates that precision was important to the cataloguer.
- In terms of controlled vocabulary there is one prominent example: 100 occurances of “spectacles” in the output (48 “spectacles” and 52 “spectacles,”), compared with zero occurances of “glasses”.
- Note: this suggests a use case for corpus linguistics in the review of catalogue data, because tools like those in AntConc’s
concordancetab can indicate historically specific cataloguing choices that may have implications for contemporary user experience for catalogue data.
The collocates of a word are those words that tend to occur in proximity to that word more than they occur in proximity to all other words in the corpus