Kothon: A Large-Scale Dataset for Machine Translation of the Chittagonian and Sylheti Dialects into Standard Bangla
Description
Chittagonian and Sylheti are two major and complex Bengali dialects spoken by over 24 million Bengali speakers. However, their well-written forms are becoming increasingly rare, putting them at risk of extinction. As these dialects differ significantly from Standard Bangla, they often create communication barriers for non-dialectal speakers. Despite this, very few research efforts have been made to address the issue. Existing resources are limited to small datasets, which are insufficient for effective preservation of dialects. To bridge this gap, this study focuses on the creation and evaluation of large-scale parallel corpora for the Chittagonian-Bangla and Sylheti-Bangla translation. A total of 8,000 Chittagonian and 9,300 Sylheti sentences were collected and annotated by five native dialect-speaking annotators. Standard Bangla sentences were gathered from open-source resources, novels, and existing datasets, complemented with text scanning of printed books. A custom web-based annotation tool was developed to aid the annotation process. The quality and reliability of the datasets were also ensured through a rigorous validation process involving independent native speakers, who reviewed translations. This dataset serves as a valuable resource for advancing research in Bengali language processing and supporting the creation of intelligent systems that help preserve dialects and promote digital communication.
Files
Steps to reproduce
The zipped file contains two separate Excel files: 1. Chittagonian.xlsx 2. Sylheti.xlsx Each Excel file contains three columns: 1. Standard Bangla: Source sentence in standard written Bangla. 2. Translated English: Machine-translated English version of the Standard Bangla sentence. 3. Dialect: The corresponding sentence in either the Chittagonian or Sylheti dialect.
Institutions
- Khulna University of Engineering and Technology