Introduction
Previous methods to evaluate word embeddings intrinsiclly (e.g. WordSim-353, SimLex-999, word analogy task) ignore the context and treat words in isolation. This paper proposes a dataset CoSimLex to evaluate the ability of word embeddings that reflect similarity judgements in context and answer the following question:
How well do word embeddings model the effects that context has on word meaning?
CoSimLex is used as the gold standard of task 3 at SemEval2020: https://competitions.codalab.org/competitions/20905
Related Work
The Stanford Contextual Word Similarity (SCWS) dataset also takes context into account and contains graded similarity judgements of word pairs. But it evaluates a discrete multi-prototype model and focuses on selecting one of the word senses, which means each word has its own distinct context.
Words-in-Context (WiC) dataset also focuses on discret word senses and cannot capture continuous effects of context in the judgements of similarity between different words.
Dataset and Task Design
CoSimLex is based on pairs of words from SimLex-999. The English dataset consists of 333 pairs. It also has 111 pairs for each of other three languages. Each pair is rated within two different contexts, giving a total of 1554 scores of contextual similarity. The task is to suitable, organically occurring contexts for each pair.
Each line of CoSimLex will be made of a pair of words selected from Simlex-999; two different contexts extracted from Wikipedia in which these two words appear; two scores of similarity, each one related to one of the contexts; and two scores of standard deviation. Figure 1 shows 1 example.

In order to evaluate how well context-dependent embeddings can predict the effect of context in human perception of similarity, two subtasks and two metrics are proposed:
- Predicting Changes: predicting the change in similarity ratings between the two contexts. The scores of difference from human annotators are averaged. Finnally the uncentered Pearson correlation is calculated.
- Predicting Ratings: predicting the absolute similarity rating for each pair in each context. Spearman correlation with gold-standard judgements is used for evaluation.