MixTox-SA: A Mixed-Coded Multilingual Toxicity Detection Dataset for South Asian Social Media

Published: 10 October 2025| Version 1 | DOI: 10.17632/bsdcd7zkyz.1
Contributors:
,
,
,
,
,
,
,
, Susmoy Biswas

Description

MixTox-SA is a comprehensive dataset developed for studying multilingual and mixed-coded toxicity detection in South Asian social media environments. The dataset consists of user-generated comments collected from public Facebook discussions, reflecting authentic patterns of Bangla, Hindi, Banglish (Romanized Bangla), and English usage. It captures the linguistic and cultural diversity of online communication in the region, where frequent code-switching and transliteration occur within a single comment or sentence. Each comment in the dataset is manually annotated into three categories: Toxic, Non-Toxic, and Neutral,to represent the full sentiment and behavioral spectrum of social discourse. The dataset maintains an almost balanced class distribution, comprising 4,370 Toxic, 4,335 Non-Toxic, and 4,330 Neutral samples, totaling 13,035 entries. Such balance ensures fairness and reliability during model training and evaluation, avoiding bias toward dominant sentiment classes. The dataset is formatted as a UTF-8 encoded CSV file containing essential fields such as the full text of each comment, its dominant language type (Bangla, Hindi, Banglish, English, or Mixed), and the assigned label. This structure enables ease of use for diverse downstream NLP tasks such as multilingual sentiment analysis, hate speech detection, cross-lingual model evaluation, and code-mixed language modeling. MixTox-SA serves as a benchmark resource for advancing transformer-based and hybrid deep learning approaches to multilingual toxicity detection. Its inclusion of both native-script and Romanized text introduces realistic noise conditions found in social media discourse, offering a valuable testbed for low-resource multilingual NLP and cross-lingual generalization research. All data were collected exclusively from publicly available sources in compliance with platform policies. Personally identifiable information has been removed or anonymized to protect user privacy. The dataset is intended strictly for academic research and educational purposes, and it does not endorse or reflect the opinions expressed in the comments.

Files

Institutions

  • Daffodil International University

Categories

Natural Language Processing

Licence