BOISHOMMO: A Standardized Multi-Label Bangla Hate Speech Dataset for Imbalance Analysis

Published: 18 August 2025| Version 3 | DOI: 10.17632/4tsb6tg9b2.3
Contributors:
Showrov Azam,

Description

BOISHOMMO is a uniquely structured, multi-label annotated dataset for hate speech analysis in Bangla — a morphologically rich and low-resource language. It addresses a significant gap in Natural Language Processing by providing a rare and detailed resource designed for multi-label classification in a non-Latin script language. The dataset also includes English translations for each Bangla comment, supporting cross-lingual research and enhancing accessibility for international researchers working in multilingual NLP and comparative linguistic studies. The dataset consists of 2,499 Bangla social media comments collected from public Facebook news pages such as Prothom Alo, Jugantor, and Kaler Kantho. Each comment was carefully and manually annotated by three native Bangla speakers, following strict guidelines to ensure consistency and accuracy. Labels were assigned across 10 overlapping hate categories: Race, Behavior, Physical, Class, Religion, Disability, Ethnicity, Gender, Sexual Orientation, and Political. The final annotation for each comment was determined by a majority voting process, and inter-annotator agreement was measured using Cohen’s Kappa to validate annotation quality. Besides its multi-aspect annotation structure and linguistic importance, BOISHOMMO emphasizes imbalance analysis. The dataset shows natural label imbalance across hate categories, reflecting real-world distributions and the challenges in hate speech detection. This feature makes it a useful benchmark for testing model robustness, creating effective multi-label classifiers, and exploring techniques like data augmentation and resampling. BOISHOMMO supports the future development of machine learning models and linguistic tools for Bangla and other under-resourced languages, helping promote inclusive and fair NLP research.

Files

Steps to reproduce

Each row in the dataset represents a Bangla social media comment with English Translations and binary indicators (0 or 1) for each of the 10 hate speech categories: Race, Behaviour, Physical, Class, Religion, Disability, Ethnicity, Gender, Sexual Orientation, and Political. To reproduce experiments: 1. Preprocess the comments (tokenization, stopword removal, Bangla stemming) 2. Use feature extraction 3. Apply any multi-label classification method (e.g., MultiOutputClassifier from scikit-learn) 4. Evaluate using macro-averaged F1 score The dataset is directly usable in any multi-label learning pipeline.

Institutions

Daffodil International University

Categories

Computational Linguistics, Natural Language Processing, Multi-Classifiers, Corpus Linguistics, Bengali Language, Speech Identification, Text Mining, Low-Resource LLM

Licence