BanglaRegionalTextCorpus: A Curated Dataset for Four Regional Bangla Dialects
Published: 27 January 2026| Version 4 | DOI: 10.17632/92r62h4k5k.4
Contributors:
Umme Ayman ayman, , Description
The BanglaRegionalTextCorpus is a manually curated dataset comprising 4,653 Bangla sentences representing four regional dialects—Rangpur, Barisal, Narail, and Khulna—along with their Standard Bangla and English translations. The data were collected through community interactions, field recordings, and online sources, followed by linguistic validation from native speakers. The corpus highlights regional lexical, phonetic, and syntactic variations, providing a valuable resource for dialect identification, translation, sociolinguistic analysis, and inclusive NLP model development
Files
Institutions
- Daffodil International UniversityDhaka District, Dhaka
- Comilla UniversityComilla
Categories
Data Science, Natural Language Processing, Text Processing, Sentence Processing