Computationally grounded semantic similarity datasets for Basque and Spanish

Published: 7 September 2023| Version 1 | DOI: 10.17632/6xr2rp8gvh.1
Josu Goikoetxea


The current word similarity datasets address a gap in psycholinguistic research by providing a comprehensive set of noun pairs with quantifications of semantic similarity. The datasets are based on two well-known Natural Language Processing resources: text corpora and knowledge bases. It aims to facilitate research in lexical processing by incorporating variables that play a significant role in semantic analysis. * Dataset Contents The dataset included in this repository provides noun pairs' information in Basque and European Spanish. It offers a rich collection of noun pairs with associated psycholinguistic features and word similarity measurements. Researchers can leverage this dataset to explore semantic similarity and lexical processing across languages. *Future Work While the current dataset focuses on Basque and European Spanish, future work aims to expand its coverage to include additional languages. This extension will broaden the scope of research possibilities and enhance cross-linguistic analysis of word similarity.


Steps to reproduce

* Creation of the dataset The creation of this dataset involved three key steps (Goikoetxea et al. 2023): - Computation of Psycholinguistic Features: Four important psycholinguistic features were computed for each noun in the dataset. These features include concreteness, frequency, semantic neighborhood density, and phonological neighborhood density. - Pairing Nouns: Nouns were paired based on the aforementioned psycholinguistic features. This process ensured a controlled variation of variables, providing diverse noun pairs for analysis. - Word Similarity Measurements: For each noun pair, three types of word similarity measurements were assigned. These measurements were computed using text-based methods, WordNet, and hybrid embeddings. * Data collection We computed three types of Basque and Spanish embeddings; text, wordnet and hybrid. Text and wordnet corpora were fed to fastText (Bojanowski et al. 2017) model to get their embeddings. For hybrid ones, we combined the former two (García eta al. 2020), using vecmap model (Artetxe et al. 2018) in one of the steps. We augmented our datasets with four linguistic feature measurements for each noun within the wordnets, leveraging Python NLTK and wordfreq libraries, followed by L2-normalisation of the resultant values. The final datasets are comprised by every possible noun pair in both languages that fulfill several criteria, and include all mentioned measurements. * List of SW and libraries: NLTK and wordfreq libraries of Python, fastText model, vecmap model * References: - Artetxe, M., Labaka, G., Agirre, E., 2018. A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 789–798. - Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching word vectors with subword information. Transactions of the association for computational linguistics, 5, 135-146. - García, I., Agerri, R., Rigau, G., 2020. A common semantic space for monolingual and cross-lingual meta-embeddings. arXiv preprint arXiv:2001.06381. - Goikoetxea, J., Arantzeta, M., & San Martin, I. (2023). Bridging Natural Language Processing and Psycholinguistics: computationally grounded semantic similarity datasets for Basque and Spanish. arXiv e-prints, arXiv-2304. * Contact:


Universidad del Pais Vasco


Computational Linguistics, Psycholinguistics