A Dataset for Translating Local Bangla (Sylheti) Dialects into Standard Bangla

Published: 10 March 2025| Version 3 | DOI: 10.17632/5rmskrvh6g.3
Contributors:
Tabia Tanzin Prama,

Description

This dataset comprises 5002 parallel sentences designed to translate Sylheti dialects into Standard Bangla, addressing the lack of digital and linguistic resources for Sylheti. The data is structured into two column: Sylheti Sentences: Original sentences in Sylheti. Standard Bangla Sentences: Corresponding translations to standard bangla sentences. The corpus was curated from diverse sources such as Bangladeshi newspapers, social media, and native speakers. Rigorous preprocessing ensured sentence alignment, linguistic accuracy, and consistency. Applications Machine Translation Text Classification Named Entity Recognition (NER) Sentiment Analysis Language Modeling

Files

Steps to reproduce

Steps to Reproduce: Collect Data: Gather bilingual Sylheti-Standard Bangla sentence pairs from newspapers, social media, and native speakers. Curate Data: Align and review sentences manually to ensure accuracy. Organize Data: Save aligned sentences in two Excel files—one for Sylheti, one for Standard Bangla. Preprocess: Remove duplicates, correct errors, and ensure consistency for high-quality data.

Institutions

Jahangirnagar University

Categories

Natural Language Processing, Machine Translation, Dialect

Licence