Measuring Patterns of Contextual Word Meaning over Time

Many researchers face the problem of having a set of word uses (a corpus, as in Table 1) and they want to have a reliable estimate of how these uses relate to each other semantically. Consider, for instance, a historical linguist trying to find out whether a word changed its meaning over time. Similar problems occur in lexicography and digital humanities. Scanning each of the uses can be tedious and is not guaranteed to be intersubjective. The protocols used to scan the corpus may vary between researchers (Kilgarriff, 1997, 2007), who are often biased towards a particular hypothesis they have in mind. The DURel Annotation Tool is designed to tackle these problems. The tool provides researchers with an online interface where they can upload word uses and annotate them following a well-established protocol for contextual word meaning annotation (Erk et al., 2013; Schlechtweg et al., 2020). The uploaded data can be assigned to registered annotators. The annotation can be stopped at any point, the annotated data can be downloaded and objective agreement measures, clustering statistics and change scores can be calculated.

Table 1: Sample of diachronic corpus, cf. Deane (1988, p. 347) and Blank (1997, pp. 412–417).


Annotators are asked to judge the degree of semantic relatedness of randomly chosen pairs of word uses, such as the two uses of arm in A and D from Table 1.

(A) and taking a knife from her pocket, she opened a vein in her little arm, and dipping a feather in the blood, wrote something on a piece of white cloth, which was spread before her.
(D) It stood behind a high brick wall, its back windows overlooking an arm of the sea which, at low tide, was a black and stinking mud-flat

Semantic relatedness is judged on the scale in Table 2. While there is a clear difference in meaning between the uses of arm in A and D, they bear a distant semantic similarity relation to each other and should hence be judged as distantly related (judgment 2).

4: Identical
3: Closely Related
2: Distantly Related
1: Unrelated

Table 2: DURel relatedness scale (Schlechtweg et al., 2018).

Graph Representation and Clustering

The annotated data is then represented in a graph (McCarthy et al., 2016). The nodes represent word uses. Weights on edges represent the median semantic relatedness judgment for a pair of uses, such as e.g. A and D from above. Consider the example in Figure 1 (left): the graph G represents the semantic relatedness structure annotated for the diachronic corpus of arm in Table 1. The nodes represent the uses from the corpus (move cursor over nodes to read), while the numbers on the connections between the nodes represent their semantic relatedness. Distances between nodes in the plot reflect the degree of relatedness of the corresponding uses (nearer uses have higher relatedness).

The system builds clusters of uses with high semantic relatedness between them (judgments 3, 4) and low semantic relatedness (judgments 1, 2) to uses from other clusters with a variation of correlation clustering (Schlechtweg et al., 2020). On the graph G this results in three clusters: C1 = {A, C, F} (blue), C2= {D, E} (orange), C3 = {B} (green). These clusters can be interpreted as word senses (Kilgarriff, 1997). C1 represents arm’s sense ‘human upper limb’, clearly expressed by uses A and C which have highest semantic relatedness (judgment 4). There is, however, some variation within this cluster, as F expresses a variant of the core sense expressed by A and C, referring to the non-human arm of a statue. Yet, F is closely related (judgment 3) to A and C as they have much in common. The uses in C1 are rather distinct from the uses D and E in C2, representing the sense ‘an inlet of water’. However, as described above they bear a distant relation to each other and are hence judged as distantly related (judgment 2). Note that within C2 we have high semantic relatedness between D and E as these uses express the same sense. C3 represents the third sense ‘weapon’. B is semantically unrelated (judgment 1) to all other uses, as there is no semantic relation e.g. between B and D.

Figure 1: Interactive semantic relatedness graph G for arm (left) and its clustered version (right). Black/gray lines represent high (3, 4) and low (1, 2) relatedness judgments respectively. Move cursor over nodes to see sentences, click and drag to move nodes.

Lexical Semantic Change

As marked in Table 1 the uses were sampled from the two time periods 1820–1860 and 1950–1990 (t1 and t2) respectively. The system builds the time-specific subgraphs G1 and G2 by removing all nodes from G which are not from the respective time period (t1, t2) and the edges between them, as displayed in Figure 2. We are now able to compare the clusters between time periods: C3 only exists in the first time period, while C2 only exists in the second time period. That is, G1 and G2 display different senses of the word arm. And this means that within our diachronic corpus arm has changed its meaning (Blank 1997, p. 113).

Figure 2: Time-specific subgraphs G1 and G2.

A practical example: English graft

Consider an example from a data set relying on the DURel annotation procedure. Figures 3 and 4 show graphs for the English noun graft from the DWUG EN data set (Schlechtweg et al., 2021). The uses were sampled from an American English diachronic corpus (CCOHA, Davies, 2012; Alatrash et al., 2020) from the two time periods 1810–1860 and 1960–2010. In contrast to our small example from above, the graph for graft does not contain edges for all possible combinations of uses. The reason is that with larger corpora the number of combinations becomes so large that it is not feasible anymore to annotate all of them. Hence, we just annotate a sample of edges. Inspect the graphs to find out how graft changed over time.

Figure 3: Interactive semantic relatedness graph G for graft (left) and its clustered version (right).

Figure 4: Time-specific subgraphs G1 and G2.


We summarize the tool’s main advantages:

  • intersubjectivity: avoids experimenter bias through standard protocol and annotation by multiple humans, inter-annotator agreement gives measure of reliability
  • simple: the judgment of use pair relatedness is an intuitive task for annotators generally yielding high agreement (Erk et al., 2013; Schlechtweg et al., 2018)
  • preparation-lean: researchers only need to sample word uses avoiding costful extraction of word sense descriptions from dictionaries
  • grounded in theory: relatedness judgments have theoretical basis in cognitive semantics (Blank 1997; Schlechtweg et al., 2018)
  • flexible: clustering algorithm and parameters can be changed after annotation, avoiding re-annotation
  • visualization: the annotated data can be intuitively visualized as semantic relatedness graphs on 2D plots


Alatrash, Reem, Dominik Schlechtweg, Jonas Kuhn, and Sabine Schulte im Walde. 2020. CCOHA: Clean Corpus of Historical American English. In Proceedings of the 12th Language Resources and Evaluation Conference, pages 6958–6966, Marseille, France. European Language Resources Association.

Blank, Andreas. 1997. Prinzipien des lexikalischen Bedeutungswandels am Beispiel der romanischen Sprachen. Niemeyer, Tübingen.

Davies, Mark. 2012. Expanding Horizons in Historical Linguistics with the 400-Million Word Corpus of Historical American English. Corpora, 7(2):121–157.

Davies, Paul D. 1988. Polysemy and cognition. Lingua, 75(4):325–361.

Erk, Katrin, Diana McCarthy, and Nicholas Gaylord. 2013. Measuring word meaning in context. Computational Linguistics, 39(3):511–554.

Kilgarriff, Adam. 1997. “I don’t believe in word senses”. Computers and the Humanities, 31(2).

Kilgarriff, Adam. 2007. Word Senses. Springer.

McCarthy, Diana, Maria Apidianaki, and Katrin Erk. 2016. Word sense clustering and clusterability. Computational Linguistics, 42(2):245–275.

Schlechtweg, Dominik, Barbara McGillivray, Simon Hengchen, Haim Dubossarsky, and Nina Tahmasebi. 2020. SemEval-2020 Task 1: Unsupervised Lexical Semantic Change Detection. In Proceedings of the 14th International Workshop on Semantic Evaluation, Barcelona, Spain. Association for Computational Linguistics.

Schlechtweg, Dominik, Sabine Schulte im Walde, and Stefanie Eckmann. 2018. Diachronic Usage Relatedness (DURel): A framework for the annotation of lexical semantic change. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 169–174, New Orleans, Louisiana.

Schlechtweg, Dominik, Nina Tahmasebi, Simon Hengchen, Haim Dubossarsky, and Barbara McGillivray. 2021. DWUG: A large Resource of Diachronic Word Usage Graphs in Four Languages.