Code-mixed Chaos : Multi-labeled Banglish & Bangla Corpus for Toxicity analysis

Name: Code-mixed Chaos : Multi-labeled Banglish & Bangla Corpus for Toxicity analysis
Creator: Maksura Binte Rabbani Nuha
Published: 2025-07-11T09:14:45.075Z
Keywords: Natural Language Processing, Machine Learning, Toxicity, Text Extraction, Bengali Language, Emotion

Rabbani Nuha, Maksura Binte; Anjum, Sadia; Aman Ullah, Mohammad

doi:10.17632/23dp3t88vk.1

Code-mixed Chaos : Multi-labeled Banglish & Bangla Corpus for Toxicity analysis

Published: 11 July 2025| Version 1 | DOI: 10.17632/23dp3t88vk.1

Contributors:

Maksura Binte Rabbani Nuha, Sadia Anjum, Mohammad Aman Ullah

Description

The dataset addresses a crucial gap in toxicity detection for Banglish—a code-mixed form of Bengali and English written in Roman script—which is often undervalued in NLP research. To mitigate this, we present a manually collected, multi-labeled dataset comprising 10,234 Banglish social media comments, annotated across 10 classes with toxic and non-toxic categories. The toxic comments are categorized into nine types: (1) Vulgar-based, (2) Religious-Hostility, (3) Troll-based, (4) Insult-based, (5) Loathe-based, (6) Threat-based, (7) Race-based, (8) Sexual-based, and (9) Political-Chaos. And a single Non-toxic category representing comments that do not have any form of toxicity. It is equally divided between toxic (5,117) and non-toxic (5,117) entries. Each sample was sourced from platforms such as Facebook, YouTube, Instagram, and X (formerly Twitter). To balance the dataset, it is enriched by selectively adding non-toxic texts from a publicly available corpus: "Bengali & Banglish: A Monolingual Dataset for Emotion Detection in Linguistically Diverse Contexts". Additionally, we provided a Bangla-translated version of the dataset to support the script-based comparative analysis in toxicity detection.

Files

Steps to reproduce

All entries underwent preprocessing, including cleaning, normalization, and standardization to reduce noise and make the data more suitable for training machine learning models. Our dataset was multi-labeled, meaning one comment could belong to more than one category. Therefore, we applied majority voting technique for annotation. After annotation and pre-processing, the dataset was transliterated to Bangla.

Institutions

International Islamic University Chittagong

Code-mixed Chaos : Multi-labeled Banglish & Bangla Corpus for Toxicity analysis

Description

Files

Steps to reproduce

Institutions

Categories

Licence