Sense Identification Dataset - SID

Published: 31 July 2020| Version 1 | DOI: 10.17632/r5fbdpvnkk.1


The Sense Identification Dataset (SID) dataset was obtained by manually annotating the term pairs in the SemEval-2017 Task 2 English dataset (described in [1]) with sense identifiers. The original dataset contains a score expressing the human similarity rating for each term pair. For each such term pair SID adds a pair of annotated senses: in particular, senses were annotated so to be compatible with (explaining) the existing similarity ratings. The underlying rationale is that the similarity rating involves a hidden step that is a sense identification step. This task, that is searching for the sense selected at (semantic similarity) rating time, is called sense individuation task; we hypothesize that it is a fundamental (though neglected) complement of the conceptual similarity. This dataset is the first dataset designed to deal with this challenging task. As mentioned, the SID dataset contains sense identifiers for each term pair: the BabelNet sense inventory was chosen, since it is broadly adopted, and because such identifiers can be easily mapped onto further resources, such as WordNet and WikiData. [1] J. Camacho-Collados, M. T. Pilehvar, N. Collier, R. Navigli, Semeval-2017task 2: Multilingual and cross-lingual semantic word similarity, in: Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval 2017). Vancouver, Canada, 2017, pp. 6–17.



Universita degli Studi di Torino


Computational Linguistics, Natural Language Processing, Sense, Lexical Processing, Semantic Processing, Lexical Semantics, Word Embedding