Computationally grounded semantic similarity datasets for Basque and Spanish

Published: 14 October 2024| Version 4 | DOI: 10.17632/6xr2rp8gvh.4
Contributor:
Josu Goikoetxea

Description

The current word similarity datasets address a gap in psycholinguistic research by providing a comprehensive set of noun pairs with quantifications of semantic similarity. The datasets are based on two well-known Natural Language Processing resources: text corpora and knowledge bases. It aims to facilitate research in lexical processing by incorporating variables that play a significant role in semantic analysis. The dataset included in this repository provides noun pairs' information in Basque and European Spanish. It offers a rich collection of noun pairs with associated psycholinguistic features and word similarity measurements. Researchers can leverage this dataset to explore semantic similarity and lexical processing across languages. In the dataset, each noun is associated with four linguistic features: concreteness (CNC), word frequency (FRQ), semantic neighborhood density (SND), and phonemic neighborhood density (PND). These features include their corresponding high- and low-valued clusters. The matching of noun pairs is entirely computational and follows these steps: for each noun, we search for every noun that matches the clusters across all four features. For each matching noun pair, we compute three types of word similarity: text embeddings similarity (SIM_TXT), WordNet-based embeddings similarity (SIM_WN) and hybrid embeddings similarity (SIM_HYB) The dataset includes the following columns in each line: - noun1 - noun2 - SIM_TXT - SIM_WN - SIM_HYB - CNC value of noun1. - Cluster identifier for the CNC of noun1. - CNC value of noun2. - Cluster identifier for the CNC of noun2. - FRQ of noun1. - Cluster identifier for the FRQ of noun1. - FRQ of noun2. - Cluster identifier for the FRQ of noun2. - PND of noun1. - Cluster identifier for the PND of noun1. - PND of noun2. - Cluster identifier for the PND of noun2. - SND of noun1. - Cluster identifier for the SND of noun1. - SND of noun2. - Cluster identifier for the SND of noun2. The feature dictionaries utilized for creating the word similarity datasets are included in the repository. Each line of the dictionary is comprised by the noun, its corresponding feature cluster, the normalized value of the feature measurement, the raw value of the feature measurement.

Files

Steps to reproduce

* Creation of the dataset in three steps (Goikoetxea et al. 2023): - Compute Psycholinguistic Features: Four important psycholinguistic features were computed for each noun in the dataset, anmely, concreteness, frequency, semantic neighborhood density, and phonological neighborhood density. - Pairing nouns: Nouns were paired based on the aforementioned psycholinguistic features. This process ensured a controlled variation of variables, providing diverse noun pairs for analysis. - Word Similarity Measurements: For each noun pair, three types of word similarity measurements were assigned, they were computed using text-based, wordnet-based and hybrid embeddings. * Replicate the Basque stemming process: - Download and install Foma from its official website. - Download eu-stemmer.zip file from repository. It contains the perl script "eu-stemmer.pl" and the FST file "stemmer.fst" - Prepare the FST File: Use a finite-state transducer (FST) specific to Basque (such as stemmer.fst in eu-stemmer.zip) with stemming rules. - Update the $fomapath variable to match the path where your Foma binary is located. - Running the Script: - Run the script from the command line by piping input text into it. For example: echo "your text here" | ./your_stemmer_script.pl * Data collection We computed three types of Basque and Spanish embeddings; text, wordnet and hybrid. Text and wordnet corpora were fed to fastText (Bojanowski et al. 2017) model to get their embeddings. For hybrid ones, we combined the former two (García eta al. 2020), using vecmap model (Artetxe et al. 2018). We augmented our datasets with four linguistic feature measurements for each noun within the wordnets, leveraging Python NLTK and wordfreq libraries. The final datasets are comprised by every possible noun pair in both languages that fulfill several criteria, and include all mentioned measurements. * List of SW and libraries: NLTK and wordfreq libraries of Python, fastText and vecmap models, foma * References: - Artetxe, M., Labaka, G., Agirre, E., 2018. A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 789–798. - Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching word vectors with subword information. Transactions of the association for computational linguistics, 5, 135-146. - García, I., Agerri, R., Rigau, G., 2020. A common semantic space for monolingual and cross-lingual meta-embeddings. arXiv preprint arXiv:2001.06381. - Goikoetxea, J., Arantzeta, M., & San Martin, I. (2023). Bridging Natural Language Processing and Psycholinguistics: computationally grounded semantic similarity datasets for Basque and Spanish. arXiv e-prints, arXiv-2304. * Contact: josu.goikoetxea@ehu.eus

Institutions

Universidad del Pais Vasco

Categories

Computational Linguistics, Psycholinguistics

Licence