Data for: Multi-Class, Multi-Label and Multi-Source Detection of Communicative Actions in Opinionated Texts.
Description
Compared to sentiment or emotion analysis, the research field of multi-class and multi-label extraction of communicative actions (i.e., suggestions, wishes, advice, experience sharing) remains fairly unexplored. The lack of datasets for training and benchmarking has made it difficult to draw research conclusions. Likely due to the limited data access, most works focus on a single source of data in their studies (i.e., suggestions). So far, only the suggestions dataset in English has been made public after the SemEval competition in 2019, and certain small corpora have been described in specific papers. As pointed out in the literature, there is also a necessity to define a linguistic approach that precisely describes not only the explicit, but also the implicit, understanding of each class. This status quo leaves some questions unanswered: 1. How frequently do such classes occur, especially with respect to different Internet genres (i.e., how many suggestions can you find on Facebook and how many pieces of advice on forums?) 2. Is it possible to build a (semantic) domain- and source-independent model and detect such classes achieving high NLP scores? In other words, can machines distinguish between e.g. suggestions and opinions like humans do, and to what extent? 3. Would there be any semantic-syntactic and thus the results would overlap between classes of similar constructions or patterns? 4. Finally, should such classes be annotated at text, sentence, or phrase level, and which classes tend to coincide within texts? In our work, we try to answer these questions by building a multi-source UGC dataset in Polish (10 thousand texts available in an open source repository), conducting a linguistic analysis of frequent communicative actions (based on the communicative actions theorem), and annotating it with several classes at phrase level to build three versions of multi-class and multi-label models (SVM, BiLSTM, and BERT-based) for comparative research purposes. We observe high correlation of classes with sources and domains, which is a good indicator of where to search for data depending on the study. We succeed at building a multi-source and multi-class classifier with scores, despite the class imbalance. We observe and note the tendencies of class co-occurrence with their frequencies and possible clashes. Uploaded dataset consists of statements annotated with different classes (i.e., suggestions, wishes, advice, experience sharing), from all analysed domains (cosmetics, banking and consumer electronics), from two different stages of annotation in .csv and .json format.
Files
Categories
Funding
European Regional Development Fund
POIR.01.01.01-00-0806/16, POIR.01.01.01-00-0923/20