PAP900
Description
The PAP900 dataset centers on the semantic relationship between affective words in Portuguese. It contains 900 word pairs, each annotated by at least 30 human raters for both semantic similarity and semantic relatedness. In addition to the semantic ratings, the dataset includes the word categorization used to build the word pairs and detailed sociodemographic information about annotators, enabling the analysis of the influence of personal factors on the perception of semantic relationships. Furthermore, this article describes in detail the dataset construction process, from word selection to agreement metrics. Data was collected from Portuguese university psychology students, who completed two rounds of questionnaires. In the first round annotators were asked to rate word pairs on either semantic similarity or relatedness. The second round switched the relation type for most annotators, with a small percentage being asked to repeat the same relation. The instructions given emphasized the differences between semantic relatedness and semantic similarity, and provided examples of expected ratings of both. There are few semantic relation datasets in Portuguese, and none focusing on affective words. PAP900 is distributed in distinct formats to be easy to use for both researchers just looking for the final averaged values and for researchers looking to take advantage of the individual ratings, the word categorization and the annotator data.
Files
Steps to reproduce
The PAP900 dataset explores semantic similarity and relatedness between affective words in Portuguese. Composed of 900 word pairs each rated by 30 psychology students, it captures two core semantic aspects: similarity (taxonomic closeness) and relatedness (associative connections), providing both average and individual ratings to help researchers understand human perception of affective words. Data Collection Participants rated word pairs on similarity or relatedness in two rounds, with most switching relations to ensure comprehensive data. Ratings were screened for reliability based on response time and consistency with average scores. Dataset Structure PAP900 includes three formats: average - Mean similarity and relatedness scores for each word pair. raw - Individual ratings, word pair categories, and annotator demographics. curated matrix - Organized scores with outliers removed for analysis. These formats allow flexibility for general and detailed studies. Dataset Value PAP900 addresses a gap in Portuguese-language semantic datasets, especially for emotions. By separating similarity and relatedness, it supports nuanced research into affective language, enabling applications in sentiment analysis and affective computing. Annotator demographics, including age, gender, and language background, offer additional layers for studying personal biases. Methodology Word pairs were drawn from "Atlas of the Heart" and translated to Portuguese, ensuring variety in emotional categories. Clear instructions helped annotators distinguish between similarity and relatedness. Outliers were identified through low correlation with average scores and removed for data quality. Annotator Agreement Agreement was measured using AMIAA and APIAA metrics. Similarity ratings showed higher agreement (0.757) than relatedness (0.675), with the latter's subjectivity reflecting affective language's complexity. Intra-annotator reliability, tested on a subset, showed reasonable consistency (0.667).
Institutions
Categories
Funding
Fundação para a Ciência e a Tecnologia
LA/P/0063/2020
Fundação para a Ciência e a Tecnologia
FRH/BD/129225/2017