ONUBAD: An Extensive Dataset for Automated Translation of Bangla Regional Dialects into Standard Bangla Language

Published: 9 December 2024| Version 2 | DOI: 10.17632/6ft99kf89b.2
Contributors:
,
,
,

Description

1. Although extensive research has been conducted on the Bangla language in natural language processing (NLP), a substantial resource gap exists for its various regional dialects, including those spoken in Chittagong, Sylhet, and Barisal. 2. Linguists even classify these as separate languages. To address this, we introduce ONUBAD, an extensive and open-access dataset for the automated translation of Chittagong, Sylhet, and Barisal dialects into Standard Bangla. 3. The translation of regional dialects into Standard Bengali can enhance communication between local farmers and agricultural extension services, help preserve cultural identity and heritage, and provide a valuable resource for research in the field of natural language processing (NLP). 4. The data was extracted from various Facebook pages, websites, and regional people in Bangladesh. It was selectively collected to ensure balanced representation across different data labels. Additionally, the data has been annotated by native experts in Bangla regional dialects. 5. This dataset captures the most frequently regional words, clauses, and sentences which consist of total 6160 words, 520 clauses, and 3920 sentences from different regions, including Chittagong, Barisal, Sylhet, and Standard Bangla. The dataset details are as follows: Barisal: --------- Words: 1540 Clause: 130 Sentence: 980 Sylhet: -------- Words: 1540 Clause: 130 Sentence: 980 Chittagong: ------------- Words: 1540 Clause: 130 Sentence: 980 Standard Bangla: ------------------- Words: 1540 Clause: 130 Sentence: 980 English Translation: ------------------- Words: 1540 Clause: 130 Sentence: 980

Files

Institutions

Bangladesh University of Engineering and Technology, Jahangirnagar University

Categories

Natural Language Processing, Machine Translation, Dialect, Bengali Language, Large Language Model

Licence