BanglaRegionalTextCorpus: A Curated Dataset for Four Regional Bangla Dialects

Published: 27 January 2026| Version 4 | DOI: 10.17632/92r62h4k5k.4
Contributors:
Umme Ayman ayman,
,

Description

The BanglaRegionalTextCorpus is a manually curated dataset comprising 4,653 Bangla sentences representing four regional dialects—Rangpur, Barisal, Narail, and Khulna—along with their Standard Bangla and English translations. The data were collected through community interactions, field recordings, and online sources, followed by linguistic validation from native speakers. The corpus highlights regional lexical, phonetic, and syntactic variations, providing a valuable resource for dialect identification, translation, sociolinguistic analysis, and inclusive NLP model development

Files

Institutions

Categories

Data Science, Natural Language Processing, Text Processing, Sentence Processing

Licence