Code-mixed Chaos : Multi-labeled Banglish & Bangla Corpus for Toxicity analysis
Description
The dataset addresses a crucial gap in toxicity detection for Banglish—a code-mixed form of Bengali and English written in Roman script—which is often undervalued in NLP research. To mitigate this, we present a manually collected, multi-labeled dataset comprising 10,234 Banglish social media comments, annotated across 10 classes with toxic and non-toxic categories. The toxic comments are categorized into nine types: (1) Vulgar-based, (2) Religious-Hostility, (3) Troll-based, (4) Insult-based, (5) Loathe-based, (6) Threat-based, (7) Race-based, (8) Sexual-based, and (9) Political-Chaos. And a single Non-toxic category representing comments that do not have any form of toxicity. It is equally divided between toxic (5,117) and non-toxic (5,117) entries. Each sample was sourced from platforms such as Facebook, YouTube, Instagram, and X (formerly Twitter). To balance the dataset, it is enriched by selectively adding non-toxic texts from a publicly available corpus: "Bengali & Banglish: A Monolingual Dataset for Emotion Detection in Linguistically Diverse Contexts". Additionally, we provided a Bangla-translated version of the dataset to support the script-based comparative analysis in toxicity detection.
Files
Steps to reproduce
All entries underwent preprocessing, including cleaning, normalization, and standardization to reduce noise and make the data more suitable for training machine learning models. Our dataset was multi-labeled, meaning one comment could belong to more than one category. Therefore, we applied majority voting technique for annotation. After annotation and pre-processing, the dataset was transliterated to Bangla.
Institutions
- International Islamic University Chittagong