BanglaDial: A Merged and Imbalanced text Dataset for Bengali Regional dialect analysis.

Published: 29 September 2025| Version 2 | DOI: 10.17632/sx6ybcps2n.2
Contributors:
,

Description

This dataset is gathered from online repositories and contains sentences in 12 distinct regional dialects of Bangladesh. The primary goal of this dataset is to support research in dialect classification, language modeling, and sociolinguistic analysis of Bangladeshi dialects. The dataset exhibits an imbalanced distribution of dialects, which reflects the natural variation in speaker population and data availability across regions. Before finalizing the corpus, several preprocessing steps were performed to ensure quality and consistency. The process began with dataset source identification and merging of different resources, followed by duplicate removal to avoid redundancy. Social media-specific elements such as mentions and hashtags were cleaned, along with the elimination of emojis that did not contribute to textual meaning. Next, punctuation and special characters were removed to maintain a cleaner text structure, and finally, whitespace normalization was applied to ensure uniform formatting. After these steps, the final dataset was generated in a ready-to-use format. The dataset is structured in two columns: (i) Sentence, representing a text string written in Bengali, and (ii) Class, indicating the name of the dialect region (e.g., Chittagong, Rajshahi). The dataset is provided in CSV and XLSX formats. Dialect-Wise Sentence Distribution Chittagong: 8,661 Kishoreganj: 8,694 Narail: 7,746 Tangail: 5,410 Rangpur: 5,881 Narsingdi: 5,735 Standard Bangla: 4,403 Barisal: 4,046 Sylhet: 3,710 Mymensingh: 3,096 Noakhali: 2,462 Rajshahi: 885 Total: 60,729 sentences

Files

Institutions

Daffodil International University

Categories

Natural Language Processing, Machine Translation, Language Identification, Bengali Language

Licence