Annotated Corpus for the Detection of Arguments and Non-Arguments for Spanish texts
The corpus contains 2875 annotated messages texts in Spanish. In addition, the annotated corpus includes the texts in which all three annotators labeled the argumentative and non-argumentative texts: 1366 (48%) were marked as an Argument and 1509 (52%) as Non-Argument.
Steps to reproduce
The phenomenon under study is the socio-political context of Peru in the years 2020 and 2021, that is, before, during, and after the 2021 general elections in that country. The textual data extraction in Spanish of the domain under study was carried out using two libraries: Tweepy and Twarc2. To filter the instances corresponding to the country of Peru, an automatic and manual filter was performed. The Twarc2 library facilitated the automatic filtering of the Peruvian texts through the "AuthorLocation" attribute. Then, the filtered data was subsequently cleaned. Finally, Cohen's Kapa and Fleiss' Kappa metrics were implemented to evaluate the concordance index between annotators to label argumentative and non-argumentative texts.