BIDWESH: A Bangla Regional Based Hate Speech Detection dataset
Published: 21 July 2025| Version 1 | DOI: 10.17632/bpkrvf882k.1
Contributors:
, , , , , Bidyarthi Paul, , Description
The BIDWESH dataset is the first benchmark corpus for hate speech detection in Bangla regional dialects, covering Noakhali, Chittagong, and Barishal. It consists of 9,183 manually translated and annotated instances derived from the BD-SHS dataset, ensuring balanced representation across dialects. Each sentence is labeled for hate or non-hate speech, with hate speech further annotated across 13 type categories and 7 target classes. The dataset preserves regional linguistic authenticity through native speaker translation and a rigorous five-stage validation process. BIDWESH supports multi-level classification, enabling nuanced analysis of hate expression in low-resource, dialectal Bangla contexts.
Files
Institutions
Southeast University
Categories
Natural Language Processing, Dialect, Bangladesh