Cognitive Distortion Dataset for Text Classification in Bahasa Indonesia

Name: Cognitive Distortion Dataset for Text Classification in Bahasa Indonesia
Creator: Hendra Suputra
Published: 2025-06-16T23:10:01.737Z
Keywords: Psychology, Mental Health, Classification System, Text Mining, Binary Classification

Suputra, Hendra; Linawati, Linawati; Sastra, Nyoman Putra; Sukadarmika, Gede; Ariwilani, Ni Made; Desira Swandi, Ni Luh Indah

doi:10.17632/k84bkv8dkt.4

Cognitive Distortion Dataset for Text Classification in Bahasa Indonesia

Published: 16 June 2025| Version 4 | DOI: 10.17632/k84bkv8dkt.4

Contributors:

,

Description

This dataset is text data related to cognitive distortion sentences that are closely related to thought disorder. This is the first dataset of cognitive distortion sentences in Indonesian. This dataset is a collection of distortion/non-distortion sentences generated from online questionnaire answers. The questions are compiled by experts in this case a psychologist. Annotation is also done by experts to obtain distortion classes. The distribution of existing cognitive distortion classes is adjusted to the theory of Burns, D.D. (1999) in the book "The Feeling Good Handbook". The total generated sentence data is 4662, there are complete sentences and parts of sentences that are distortion parts flanked by the "$" sign, along with labels from two annotators in separate columns. Several distortion classes with a limited number of samples were augmented using the back-translation method. The four augmented classes are "Mental Filter," "All-or-Nothing Thinking," "Magnification or Minimization," and "Emotional Reasoning." Each class was expanded to a total of 200 samples. The back-translation process utilized five languages: Chinese (ZH), English (EN), Javanese (JV), Malay (MS), and Tagalog (TG). In the accompanying CSV file, the "DATA STATUS" column indicates the origin of each sentence. Entries labeled "ORI-RAW" refer to raw data collected directly from questionnaire responses. Entries labeled "DIS-[...]" represent distortion sentences generated through back-translation using the five language codes (ZH, EN, JV, MS, and TG). Apart from Indonesian, an English version is also available.

Files

Steps to reproduce

The dataset in this study was collected using the questionnaire method. The questionnaire contains everything from personal data, visits to psychologists to questions about life. The questionnaire is intended for Indonesians aged 18 and over. The question model proposed in this study has been compiled based on discussions with experts in the field of psychology or a psychologist. The questionnaire was distributed online through the Google Form platform. There were 593 respondents in the process. Then the experts analyzed and annotated each answer given by the respondents. The process then produced a dataset consisting of 4662 sentences. Additional data was generated using the back-translation augmentation method, resulting in a total of 4,992 sentences.

Institutions

Universitas Udayana

Cognitive Distortion Dataset for Text Classification in Bahasa Indonesia

Description

Files

Steps to reproduce

Institutions

Categories

Licence