BIDWESH: A Bangla Regional Based Hate Speech Detection dataset

Published: 21 July 2025| Version 1 | DOI: 10.17632/bpkrvf882k.1
Contributors:
,
,
,
,
, Bidyarthi Paul,
,

Description

The BIDWESH dataset is the first benchmark corpus for hate speech detection in Bangla regional dialects, covering Noakhali, Chittagong, and Barishal. It consists of 9,183 manually translated and annotated instances derived from the BD-SHS dataset, ensuring balanced representation across dialects. Each sentence is labeled for hate or non-hate speech, with hate speech further annotated across 13 type categories and 7 target classes. The dataset preserves regional linguistic authenticity through native speaker translation and a rigorous five-stage validation process. BIDWESH supports multi-level classification, enabling nuanced analysis of hate expression in low-resource, dialectal Bangla contexts.

Files

Institutions

Southeast University

Categories

Natural Language Processing, Dialect, Bangladesh

Licence