BOISHOMMO: Multi-label Hate Speech Annotations for Bangla — A Low-Resource Language Perspective

Published: 7 July 2025| Version 2 | DOI: 10.17632/4tsb6tg9b2.2
Contributors:
Showrov Azam,

Description

BOISHOMMO is a uniquely structured, multi-label annotated dataset for hate speech analysis in Bangla — a morphologically rich and low-resource language. It fills a notable gap in Natural Language Processing by offering a rare and nuanced resource tailored for multi-label classification in a non-Latin script language. The dataset comprises 2,499 Bangla-language social media comments collected from public Facebook news pages such as Prothom Alo, Jugantor, and Kaler Kantho. Each comment was carefully and manually annotated by three native Bangla-speaking annotators, following strict guidelines to ensure consistency and accuracy. Labels were assigned across 10 overlapping hate categories: Race, Behaviour, Physical, Class, Religion, Disability, Ethnicity, Gender, Sexual Orientation, and Political. The final annotation for each comment was determined using a majority voting mechanism, and inter-annotator agreement was measured using Cohen’s Kappa to validate annotation quality. Due to its multi-aspect annotation structure and focus on a low-resource language, BOISHOMMO stands out as a valuable benchmark dataset for researchers working in hate speech detection, multilingual NLP, social media analysis, and multi-label text classification. It also supports the future development of machine learning models and linguistic tools for Bangla and other linguistically similar under-resourced languages.

Files

Steps to reproduce

Each row in the dataset represents a Bangla social media comment and binary indicators (0 or 1) for each of the 10 hate speech categories: Race, Behaviour, Physical, Class, Religion, Disability, Ethnicity, Gender, Sexual Orientation, and Political. To reproduce experiments: 1. Preprocess the comments (tokenization, stopword removal, Bangla stemming) 2. Use feature extraction 3. Apply any multi-label classification method (e.g., MultiOutputClassifier from scikit-learn) 4. Evaluate using macro-averaged F1 score Random Forest performed best in our baseline experiments (F1 ≈ 86%), outperforming SVM and Logistic Regression. The dataset is directly usable in any multi-label learning pipeline.

Institutions

Daffodil International University

Categories

Computational Linguistics, Natural Language Processing, Multi-Classifiers, Corpus Linguistics, Bengali Language, Speech Identification, Text Mining, Low-Resource LLM

Licence