BD-Dialect: A Multiregional Bangla Language Dataset

Name: BD-Dialect: A Multiregional Bangla Language Dataset
Creator: Anika Rahman
Published: 2026-01-05T11:28:19.369Z
Keywords: Linguistics, Computer Science, Natural Language Processing, Dialectology, Low-Resource LLM

Rahman, Anika; Hasan Muna, Nafesha; Prity, Masuma Saba

doi:10.17632/k769s4vk5z.2

BD-Dialect: A Multiregional Bangla Language Dataset

Published: 5 January 2026| Version 2 | DOI: 10.17632/k769s4vk5z.2

Contributors:

, Nafesha Hasan Muna, Masuma Saba Prity

Description

The BD-Dialect dataset provides parallel linguistic data for Standard Bangla and five of its major regional dialects: Noakhali, Sylheti, Chittagong, Rajshahi, and Mymensingh. It includes aligned translations at both the word and clause levels, along with English translations for cross-linguistic reference. The dataset is organized into two primary CSV files, each containing 950 entries: BD-Dialect_Words.csv – Word-level aligned translations across all six language variants. BD-Dialect_Clauses.csv – Clause/sentence-level aligned translations across all six language variants. BD-Dialect_Metadata.csv – Detailed metadata describing each column/variable, including validation information. BD-Dialect_Audio_Samples.zip – A small set of audio recordings (mp4 format) from native speakers for phonetic reference and verification. BD-Dialect_Preprocessing_Scripts.ipynb – Python Jupyter notebook containing scripts for data cleaning, normalization, and basic analysis. File Format: All CSV files are UTF-8 encoded with header rows and can be imported into Python (Pandas), R, Excel, or similar tools. The Jupyter notebook requires a Python environment and was tested in Google Colab. Usage Notes: Use the BD-Dialect_Words.csv and BD-Dialect_Clauses.csv files for linguistic analysis or model training. Refer to BD-Dialect_Metadata.csv to understand the structure, source, and validation status of each linguistic column. The audio samples are provided as a limited pilot set for phonetic verification and are not a comprehensive audio corpus. The preprocessing scripts demonstrate the data cleaning pipeline and can be adapted for further analysis. Applications: This dataset is designed to support a wide range of research and development activities, including: Dialect Identification & NLP: Training and evaluating models for dialect classification, speech recognition, and text normalization. Machine Translation: Developing systems for translation between Standard Bangla and its dialects, or between dialects and English. Linguistic Research: Enabling comparative studies in dialectology, phonology, and lexical variation. Resource for Low-Resource Languages: Providing a foundational, validated corpus for Bangla, an underrepresented language in NLP. Educational Tools: Serving as a resource for language learning and sociolinguistic studies. Citation: If you use this dataset, please cite: Rahman, Anika; Hasan Muna, Nafesha; Prity, Masuma Saba (2026), “BD-Dialect: A Multiregional Bangla Language Dataset”, Mendeley Data, V2, doi: 10.17632/k769s4vk5z.2 License: CC BY 4.0 – allowing reuse with proper attribution.

Files

Institutions

Stamford University Bangladesh

BD-Dialect: A Multiregional Bangla Language Dataset

Description

Files

Institutions

Categories

Licence