LWL-prints: Collocates
Overview
Teaching: 15 min
Exercises: 10 minQuestions
What are collocates?
How do collocates work in AntConc?
What conclusions can be reached based on collocation data?
Objectives
Learn how collocates work in AntConc
This is a development version of the training module based on the LWL_prints-data.txt dataset (a single corpus of around 37k words). For information on how we processed the .txt files in LWL_prints-data.txt for use in AntConc, see *CatalogueLegacies/transmission: Analysis of transmission from BMSat to LWL*.
Introducing Collocates
The collocates of a word are those words that tend to occur in proximity to that word more than they occur in proximity to all other words in the corpus. The idea of collocation is implemented using a variety of different statistics to determine the co-occurrence of words.
Click on the Collocates
tab, enter the string “behind” in the search box, ensure that Words
is ticked and Case
is unticked, change Sort by
to Sort by Freq
, then press Start
. After a little time an error will pop up.
AntConc tab interdependencies
- Outputs in the
Collocates
tab are based on data generated by theWord List
tab, so both need to have the sameTool Preferences
for theCollocates
tab to work properly.- In this case, go to
Tool Preferences
, untickTreat all data as lowercase
, hitApply
, and then hitStart
again.- The moral here then is that in AntConc any search needs to be undertaken with care. For
Collocates
in particular, you need to know exactly what you’ve searched for in order to read the statistical output.
AntConc then presents a slightly confusing screen. It contains the following information:
- A ranking of words by frequency, specifically - using the default
Collocates
settings - those five words either side of the word “behind”. The proximity to the searched for word can be changed in the bottom right of the screen, more on which shortly. - The frequency of the word, broken down into two columns: one by frequency to the left of the word “behind”, and one by frequency to the right of the word “behind”.
- A
Stat
(more on which shortly).
Reading Collocates
Browsing this we can start to make some observations, building on similar themes from episodes thirteen and fourteen. We see many commons words (“the”, “of”, “and”, “a”). We see that people (“his”, “her”, “him”, “them”, “man”, “woman”) and actions (“stands”, “standing”) are related to this spatial term. And we see a long tail of vocabulary, 386 of the 519 unique words (or about 3 in 4) occur only once in proximity to the word “behind’.
Now edit the From..
and To..
settings to 1L
and 1R
respectively, and hit Start
again. A few things stand out:
- Some high frequency words have zero or very small frequencies on one or either side of “behind”;
- Some words have jumped up the list (“him” from 11 to 2, “her” from 8 to 3, “swung” from 81 to 18);
- Commons words (“the”, “a”, “on”, “in”) have
Stat
values of two or lower.
This output tells us something about both language use and the subject of the cataloguing.
- There are a lot of people and things (scroll down) described in relation to their relative depth.
- The high frequency of ‘standing behind’ compared with ‘Standing behind’ suggests that the relation between things doesn’t cross sentences. If we click on “standing” we go - once again - to the
Condordance
tab to see examples of this. And whilst the small number of examples (5) is not statistically significant, this gives us a way into the style of the cataloguer(s) and reminds us of the value of retaining capitalization. - At a more trivial level, the absence of “his behind” and “her behind” tells us that posteriors are not described by the cataloguer.
- We can also start to make inferences about the high
stat
values for words that have jumped up the list (“him”, “swung”, “phaeton”), as well as the use of proper names and locations, though to do this properly we need to know more about what thestat
value means. As a rule, collocation statistics should be read with caution.
Reading stat values
Task 1: Taking the word “man” as an example, what might the stat value signify?
- Note: to solve this problem, start by going to the
Collocates
tab for the string “towards” (Words
ticked,Case
unticked,From..
andTo..
settings to1L
and1R
respectively) and observe the stat column. Note that the values around 2 and below (this can even go into the negative!) are words like “a”, “on”, “the”; words that we know are common in the corpus. Note also that the higher stat values are for those words we’ve not really seen before (in theWord List
tab “obese”, “elderly”, and “short” are 16th, 17th, and 24th respectively).Solution
- The stat value signifies the unusually high or low occurrence of words near the target word, compared to the occurance of those words in the corpus as a whole. So, there are fewer occurrences of “a”
1L/1R
of “man” than we would expect given the frequency of “a” in the corpus, and a greater number of occurances of “obese”, “thin” or “elderly”1L/1R
of “man” than we would expect given the frequency of “obese”, “thin” and “elderly” in the corpus
- Note: by default, the
Stat
column records a ‘Mutual Information’ score, which is a measure of the probability that the collocate and key word occur near to each other, relative to how many times they each occur in total.
Now we have a better sense of what Stat
is doing, change Sort by
to Sort by Stat
and hit Start
. Most of the top 50 or so (of 161) ranked works are now those that occur only once or twice 1L/1R
of “towards”, and that - as a result - have high stat scores. This isn’t very useful. To work more effectively with Sort by Stat
, change the Min. Collocate Frequency
field to “5” and hit Start
. We now have sensible results - “obese”, “thin”, and “elderly” pop towards the top of the list, with action words a little further down, and by browsing the list we can continue to make inferences about both the language used in cataloguing and the subject of that cataloguing:
- Where verbs appeaar, they are in present tense form.
- There are a variety of modifiers that describe physical appearance.
- “man with a” and “man in a” are alternative forms of describing appearance or dress (click on “with” and “in” to see this).
- Men often appear at the start of sentences (browse “A”).
From collocation to curatorial voice
The Collocates
tab enables you to create a statistical overview of a corpus. But small changes to the variables in the Collocates
tab can significantly change the statistics that are produced. The tool then needs to be used with caution.
Task 2: How regularised are the descriptions of clothing, accessories, and body adornments?
- Note: to solve this problem, start by searching in the
Collocates
tab forwear|wears|wearing|wore
(withWords
ticked,Case
unticked). You may need to adjust the other settings to capture the ways that words for clothing (e.g. “hat”) are used in proximity to the verb “to wear”.Solution
- There is no one way of examining this problem. One approach is to edit your
From..
andTo..
settings to4L
and4R
respectively, chose theSort by Freq
option, and set theMin. Collocate Frequency
field to “2”.- Hit
Start
. Note that even at this low minimum collocate frequency, only generalised language is present: we see “hats”, “caps”, “wigs” and “coats”, but no regular use of modifiers or specialised language. This indicates that general terminology rather than sartorial precision was important to the cataloguer.- In terms of modifiers, whilst of low frequency, they are about size (“miniature (48), “enormous” (55)) and materials (“silk” (73), “plaid” (84)), or on occasion evaluative and positional: to whom is an item of clothing “elaborate” (108)?
- Note: cases like the latter suggests a use case for corpus linguistics in the review of catalogue data, because tools like those in AntConc’s
collocates
tab (paired perhaps with theconcordance
tab) can indicate historically specific cataloguing choices that may have implications for contemporary user experience for catalogue data.
Finding archaic language
AntConc can support cataloguers looking to find archaic and problematic language in their catalogues without needing to first build a list of vocabulary to look for. This can be achieved by browsing word lists, though depending on the size of your catalogue data, that could prove an extremely time consuming approach. An alternative would be use
Collocates
tab. We describe a potential iterative process below:
- Start by identifying a word to search around, perhaps a verb. So as to ensure that AntConc doesn’t hang for long periods for large corpora, choose a verb form roughly 20 times less frequent than the most frequent word. In the case of the LWL_prints-data.txt catalogue data size is less of an isse, the word “of” (n=1089) is an ideal example.
- Next, edit your
From..
andTo..
settings to5L
and5R
, chose theSort by Stat
option, and set theMin. Collocate Frequency
field to “5”. Type “standing” into the search box and hitStart
.
- Note that whilst this approach may save you time browsing an alphabetically sorted wordlist, it becomes gradually more computationally expensive approach (though by no means as concerning as the environmental impact of large language models, AI or digital preservation. This means that with large datasets AntConc may take a number of minutes to return results. We recommend that you run this query on a separate device or at a time when you don’t need your main computer for other computationally intensive jobs (like a video call).
- Browse the outputs for archaic vocabulary. At this level of miniumum collocate frequency you are unlikely to find many examples, but may find clicking on words to read an
Concordance
useful (e.g. does the vocabulary around the word ‘her’ indicate a tendency to describe women in relation to their attractiveness?)- Gradually reduce the value in the
Min. Collocate Frequency
field (e.g. to “3” in your next iteration) and expand theFrom..
andTo..
settings (e.g. to10L
and10R
), interating through a number of outputs until you either a) start finding results to focus on, or b) you slow AntConc such that it is proving inefficient.
- Note that as you interate through, because we have set
Sort by
toSort by Stat
new results should pop closer to the top of the output, especially as you reduce the value in theMin. Collocate Frequency
field closer to 1. We may, for example, ask who are the people “grinning” and any racial connotations thereof.
Key Points
The collocates of a word are those words that tend to occur in proximity to that word more than they occur in proximity to all other words in the corpus