IV-NLP: A Methodology to understand the behavior of DL models and its Application from a Causal Approach
Description
These datasets are associated with an titled article "IV-NLP: A Methodology to understand the behavior of DL models and its Application from a Causal Approach" that will be submitted to MDPI—Electronics during this week. The final title of the article may be subject to modifications during the review and editing process. The dataset's DOI will remain unchanged, and the information will be updated once the article has been accepted. This dataset (V4) extends and builds upon the previous dataset (V3), which was originally published in 2023. The previous version is available at DOI: https://data.mendeley.com/datasets/xh7vvty9zt/3. The data sets hosted in this repository are the following: The original dataset has 4015 records. The characteristics of the original dataset are as follows: - Id (Automatic sequential from 0 to 4014) - Tweet_Checked (Tweet Text) - Fecha (Publication date) - Clase_Argumento (0 and 1) The synthetic dataset has 3966 records. The synthetic texts in this dataset were generated using the GPT-3.5-turbo-0125 model. The characteristics of the synthetic dataset are as follows: - Id (Automatic sequential from 0 to 3965) - Tweet_Checked (Tweet Text) - Original_Id (Id that corresponds to the original data) - Fecha (Publication date) - Clase_Argumento (0 and 1)
Files
Steps to reproduce
Original Data Set: The dataset was generated from a corpus of 4,000 records used to propose and evaluate an annotation method for argumentative and non-argumentative texts in Spanish (Guzmán-Monteza, Y., 2023). The previously published corpus for the detection of argumentation in Spanish corresponding to the first annotation task (Guzmán, 2023), is available in Mendeley Data [DOI: 10.17632/xh7vvty9zt.3] and contains 2875 texts annotated as Argument or No_Argument (1 and 0 respectively). Subsequently, a review of 1125 labelled records were carried out to guarantee the reliability of the data. The 4000 records were extracted from the social network Twitter (2021), covering the period 2020 and 2021 during the general elections in Peru. Next, 15 records extracted from the Portal of the Constituent Process at the Service of the Peoples of Peru (Guzman et al., 2021) were added. In total, 4,015 records labeled in Spanish were obtained. Finally, the original data was distributed (Training, testing, and validation). Likewise, for each data subset, a sub subset of argumentative and non-argumentative texts was distributed to preserve the class balance during the data preparation stage. Synthetic Data Set: 1. Input – Original texts: Six (06) data subsets were generated: argumentative and non-argumentative texts for each training, test, and validation data set. 2. Intervention Method: These adjustments included parameter adjustment, elaboration of specific inputs with precise examples to provide context to the model, rules to specify the end of the generated sentence, and the specific configuration of the parameter’s temperature, max_tokens, and stop. Also, a variation of the RAG (Retrieval-Augmented Generation) technique was included (creation of a speech marker file and creation of an authority list file) to introduce new information and reduce the generation of false ones. 3. Output – Texts generated by the model: The results generated by the model were saved. 4. Validation of texts generated by the model: Each time errors were identified during the validation process of the results produced by the model for each of the six (06) data sets, the records in the data set were relocated and/or eliminated as appropriate. This method was chosen to use the records from the synthetic data sets efficiently. 5. Finally, the argumentative and non-argumentative records correctly generated by the model for each of the six (06) data sets, respectively, were totaled.