MELD: A Multilingual Ethnic Dataset of Chakma, Garo, Marma, and Standard Bengali in Bengali Script

Published: 6 January 2025| Version 1 | DOI: 10.17632/dy5dyfygbp.1
Contributors:
,
,
,

Description

The MELD dataset (Multilingual Ethnic Language Dataset) is designed to address the severe underrepresentation of ethnic languages in computational linguistics and natural language processing (NLP). It includes transliterated text samples from Chakma, Garo, and Marma, alongside Standard Bengali, collected to reflect real-world use. This dataset provides valuable linguistic insights into low-resource and endangered languages written in the Bengali script. The data was gathered through a rigorous process of interviews with native speakers, written contributions, and manual transliteration into Bengali alphabets. With 2,230 annotated sentences, it highlights the unique linguistic patterns of ethnic communities who use Bengali script to write their native languages, especially on social media. The dataset is suitable for tasks like language identification, machine translation, and sentiment analysis. By enabling NLP researchers and linguists to develop tools for language processing, the dataset aims to foster inclusive technology development while promoting cultural preservation. Its applications include building language identification models, creating translation systems, and supporting the study of linguistic diversity. Researchers are encouraged to use MELD for advancing computational research in low-resource and ethnic languages. Chakma: --------- Words: 4529 Sentence: 808 Garo: -------- Words: 1680 Sentence: 314 Marma: ------------- Words: 1244 Sentence: 292 Standard Bangla: ------------------- Words: 4380 Sentence: 816

Files

Institutions

Daffodil International University

Categories

Natural Language Processing, Machine Translation, Language Identification, Ethnicity, Bengali Language

Licence