Computationally grounded semantic similarity datasets for Basque and Spanish

Published: 1 July 2024| Version 2 | DOI: 10.17632/6xr2rp8gvh.2
Josu Goikoetxea


The current word similarity datasets address a gap in psycholinguistic research by providing a comprehensive set of noun pairs with quantifications of semantic similarity. The datasets are based on two well-known Natural Language Processing resources: text corpora and knowledge bases. It aims to facilitate research in lexical processing by incorporating variables that play a significant role in semantic analysis. The dataset included in this repository provides noun pairs' information in Basque and European Spanish. It offers a rich collection of noun pairs with associated psycholinguistic features and word similarity measurements. Researchers can leverage this dataset to explore semantic similarity and lexical processing across languages. For each noun in the dataset four linguistic features have been computed and included in the dataset (concreteness (cnc), word frequency (frq), semantic neighborhodd density (snd), phonemic neighborhood density (pnd)) along with their corresponding high- and low-valued clusters. The matching of noun pairs if fully computational and it's conducted following the next steps: for each noun, search every noun that matches the clusters in all four features, and for every noun pair match compute three types of word similarity. The mentioned types of word similarity measurements computed for every noun pair are text embeddings' similarity (sim_txt), wordnet-based embeddings' similarity (sim_wn), hybrid embeddings' similarity (sim_hyb). Thus, each line in the word similarity datasets (in both languages) is comprise by the next 21 columns: noun1, noun2, sim_txt, sim_hyb, cluster cnc noun1, cnc noun1, cluster cnc noun2, cnc noun2, cluster frq noun1, frq noun1, cluster frq noun2, frq noun2, cluster pnd noun1, pnd noun1, cluster pnd noun2, pnd noun2, cluster snd noun1, snd noun1, cluster snd noun2, snd noun2. The feature dictionaries utilized for creating the word similarity datasets are included in the repository. Each line of the dictionary is comprised by the noun, its corresponding feature cluster, the normalized value of the feature measurement, the raw value of the feature measurement.


Steps to reproduce

* Creation of the dataset The creation of this dataset involved three key steps (Goikoetxea et al. 2023): - Computation of Psycholinguistic Features: Four important psycholinguistic features were computed for each noun in the dataset. These features include concreteness, frequency, semantic neighborhood density, and phonological neighborhood density. - Pairing Nouns: Nouns were paired based on the aforementioned psycholinguistic features. This process ensured a controlled variation of variables, providing diverse noun pairs for analysis. - Word Similarity Measurements: For each noun pair, three types of word similarity measurements were assigned. These measurements were computed using text-based methods, WordNet, and hybrid embeddings. * Data collection We computed three types of Basque and Spanish embeddings; text, wordnet and hybrid. Text and wordnet corpora were fed to fastText (Bojanowski et al. 2017) model to get their embeddings. For hybrid ones, we combined the former two (García eta al. 2020), using vecmap model (Artetxe et al. 2018) in one of the steps. We augmented our datasets with four linguistic feature measurements for each noun within the wordnets, leveraging Python NLTK and wordfreq libraries, followed by L2-normalisation of the resultant values. The final datasets are comprised by every possible noun pair in both languages that fulfill several criteria, and include all mentioned measurements. * List of SW and libraries: NLTK and wordfreq libraries of Python, fastText model, vecmap model * References: - Artetxe, M., Labaka, G., Agirre, E., 2018. A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 789–798. - Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching word vectors with subword information. Transactions of the association for computational linguistics, 5, 135-146. - García, I., Agerri, R., Rigau, G., 2020. A common semantic space for monolingual and cross-lingual meta-embeddings. arXiv preprint arXiv:2001.06381. - Goikoetxea, J., Arantzeta, M., & San Martin, I. (2023). Bridging Natural Language Processing and Psycholinguistics: computationally grounded semantic similarity datasets for Basque and Spanish. arXiv e-prints, arXiv-2304. * Contact:


Universidad del Pais Vasco


Computational Linguistics, Psycholinguistics