Collocates

Overview

Teaching: 15 min
Exercises: 10 min
Questions
  • What are collocates?

  • How do collocates work in AntConc?

  • What conclusions can be reached based on collocation data?

Objectives
  • Learn how collocates work in AntConc

Introducing Collocates

The collocates of a word are those words that tend to occur in proximity to that word more than they occur in proximity to all other words in the corpus. The idea of collocation is implemented using a variety of different statistics to determine the co-occurrence of words.

Click on the Collocates tab, enter the string “behind” in the search box, ensure that Words is ticked and Case is unticked, change Sort by to Sort by Freq, then press Start. After a little time an error will pop up.

AntConc tab interdependencies

  • Outputs in the Collocates tab are based on data generated by the Word List tab, so both need to have the same Tool Preferences for the Collocates tab to work properly.
  • In this case, go to Tool Preferences, untick Treat all data as lowercase, hit Apply, and then hit Start again.
  • The moral here then is that in AntConc any search needs to be undertaken with care. For Collocates in particular, you need to know exactly what you’ve searched for in order to read the statistical output.

AntConc then presents a slightly confusing screen. It contains the following information:

Reading Collocates

Browsing this we can start to make some observations, building on similar themes from episodes five and six. We see many commons words (“the”, “of”, “and”, “a”). We see that men (“him”, “Sir”, “his”, “man”) and actions (“standing”, “seated”) are related to this spatial term. And we see a long tail of vocabulary, 855 of the 1136 unique words (or about 3 in 4) occur only once in proximity to the word “behind’.

Now edit the From.. and To.. settings to 1L and 1R respectively, and hit Start again. A few things stand out:

This output tells us something about both language use and the subject of the cataloguing.

Reading stat values

Task 1: Taking the word “towards” as an example, what might the stat value signify?

  • Note: to solve this problem, start by going to the Collocates tab for the string “towards” (Words ticked, Case unticked, From.. and To.. settings to 1L and 1R respectively) and observe the stat column. Note that the values around 2 and below (this can even go into the negative!) are words like “a”, “View”, “an”; words that we know are common in the corpus. Note also that the higher stat values are for those words we’ve not really seen before (in the Word List tab “river”, “valley”, and “mountains” are 98th, 336th, and 387th respectively).

Solution

  • The stat value signifies the unusually high or low occurrence of words near the target word, compared to the occurance of those words in the corpus as a whole. So, there are fewer occurrences of “a” 1L/1R of “towards” than we would expect given the frequency of “a” in the corpus, and a greater number of occurances of “river” or “valley” 1L/1R of “towards” than we would expect given the frequency of “river” and “valley” in the corpus
    • Note: by default, the Stat column records a ‘Mutual Information’ score, which is a measure of the probability that the collocate and key word occur near to each other, relative to how many times they each occur in total.

Now we have a better sense of what Stat is doing, change Sort by to Sort by Stat and hit Start. Most of the top 150 or so ranked works are now those that occur only once or twice 1L/1R of “towards”, and that - as a result - have high stat scores. This isn’t very useful. To work more effectively with Sort by Stat, change the Min. Collocate Frequency field to “10” and hit Start. We now have sensible results - “mountains” (in various forms, including errors) and “valley” pop towards the top of the list, with “river” a little further down, and by browsing the list we can continue to make inferences about both the language used in cataloguing and the subject of that cataloguing:

From collocation to curatorial voice

The Collocates tab enables you to create a statistical overview of a corpus. But small changes to the variables in the Collocates tab can significantly change the statistics that are produced. The tool then needs to be used with caution.

Task 2: How regularised are the descriptions of clothing, accessories, and body adornments?

  • Note: to solve this problem, start by searching in the Collocates tab for wear|wears|wearing|wore (with Words ticked, Case unticked). You may need to adjust the other settings to capture the ways that words for clothing (e.g. “hat”) are used in proximity to the verb “to wear”.

Solution

  1. There is no one way of examining this problem. One approach is to edit your From.. and To.. settings to 4L and 4R respectively, chose the Sort by Stat option, and set the Min. Collocate Frequency field to “5”.
  2. Hit Start. Note that even at this low minimum collocate frequency, only generalised language is present: we see “hats”, “robes”, “turban” and “coat”, but no regular use of modifiers or specialised language. This indicates that general terminology rather than sartorial precision was important to the cataloguer.
  3. In terms of modifiers, whilst of low frequency, they are evaluative and positional: to whom is an item of clothing “ceremonial”, “traditional”, or “elaborate”?
    • Note: this suggests a use case for corpus linguistics in the review of catalogue data, because tools like those in AntConc’s collocates tab (paired perhaps with the concordance tab) can indicate historically specific cataloguing choices that may have implications for contemporary user experience for catalogue data.

Finding archaic language

AntConc can support cataloguers looking to find archaic and problematic language in their catalogues without needing to first build a list of vocabulary to look for. This can be achieved by browsing word lists, though depending on the size of your catalogue data, that could prove an extremely time consuming approach. An alternative would be use Collocates tab. We describe a potential iterative process below:

  1. Start by identifying a word to search around, perhaps a verb. So as to ensure that AntConc doesn’t hang for long periods, choose a verb form roughly 20 times less frequent than the most frequent word. In the case of the IAMS Photos catalogue data, the word “standing” (n=2362) is an ideal example.
  2. Next, edit your From.. and To.. settings to 5L and 5R, chose the Sort by Stat option, and set the Min. Collocate Frequency field to “10”. Type “standing” into the search box and hit Start.
    • Note that whilst this approach may save you time browsing an alphabetically sorted wordlist, it becomes gradually more computationally expensive approach (though by no means as concerning as the environmental impact of large language models, AI or digital preservation. This means that AntConc may take a number of minutes to return results. We recommend that you run this query on a separate device or at a time when you don’t need your main computer for other computationally intensive jobs (like a video call).
  3. Browse the outputs for archaic vocabulary. At this level of miniumum collocate frequency you are unlikely to find many examples, but may find clicking on words to read an Concordance useful (e.g. does the vocabulary around the word ‘Mrs’ indicate a tendency to describe women only in relation to their male relatives?)
  4. Gradually reduce the value in the Min. Collocate Frequency field (e.g. to “5” in your next iteration) and expand the From.. and To.. settings (e.g. to 10L and 10R), interating through a number of outputs until you either a) start finding results to focus on, or b) you slow AntConc such that it is proving inefficient.
    • Note that as you interate through, because we have set Sort by to Sort by Stat new results should pop closer to the top of the output, especially as you reduce the value in the Min. Collocate Frequency field closer to 1. In the case of the IAMS Photos catalogue data, who is the boy “grinning” (and any racial connotations) or the continued appropriateness of vocabulary such as “toddy” (for ‘toddy drawer’) may benefit from investigation.

Key Points

  • The collocates of a word are those words that tend to occur in proximity to that word more than they occur in proximity to all other words in the corpus