Data sets

The data for the GermEval 2015: LexSub task is described in Cholakov et al., 2013. All together it consists of 2040 sentences from the German Wikipedia, each containing a target word and a list of substitutions proposed by human annotators. There are 153 unique target words, equally distributed across parts of speech (nouns, verbs, and adjectives) and three frequency groups. About half of this data (26 nouns, 26 verbs, and 26 adjectives in 1040 sentence contexts) forms the training set, which is made available to participants in advance. The remainder forms the test set, which will be used for the evaluation and published in full only after the shared task is completed.