ChakmaBridge: A Five-Way Parallel Corpus for Navigating the Script Divide in an Endangered Language
Description
The advancement of NLP technologies for low-resource and endangered languages is critically hindered by the scarcity of high-quality, parallel corpora. This is particularly true for languages like Chakma, which also faces the challenge of prevalent non-standard, romanized script usage in digital communication. To address this, we introduce ChakmaBridge, the first five-way parallel corpus for Chakma, containing 807 sentences aligned across English, Standard Bangla, Bengali-script Chakma, Romanized Bangla, and Romanized Chakma. Our dataset is created by augmenting the MELD corpus with LLM-generated romanizations that are rigorously validated by native speakers. We release ChakmaBridge to facilitate research in low-resource MT and aid in the digital preservation of this endangered language. Citation: Rahman, Md Abdur, Md Tofael Ahmed Bhuiyan, and Abdul Kadar Muhammad Masum. "ChakmaBridge: A Five-Way Parallel Corpus for Navigating the Script Divide in an Endangered Language." In Proceedings of the Second Workshop on Bangla Language Processing (BLP-2025), pp. 259-265. 2025.
Files
Institutions
- Southeast UniversityDhaka Division, Dhaka