Vashantor: A Large-scale Multilingual Benchmark Dataset for Automated Translation of Bangla Regional Dialects to Bangla Language

Published: 15 January 2024| Version 2 | DOI: 10.17632/bj5jgk878b.2
Contributors:
,
,
,
,
,

Description

The Vashantor dataset consists of 32,500 sentences from different regions, including Chittagong, Noakhali, Sylhet, Barishal, and Mymensingh. It is categorized into two language formats: "Bangla" and "Banglish." Each region and language combination has specified quantities for training, testing, and validation samples. The dataset details are as follows: Specifics of the Core Data: —------------------------------- Bangla: Train 1875, Test 375, Validation 250 (Total 2500) Banglish: Train 1875, Test 375, Validation 250 (Total 2500) English: Train 1875, Test 375, Validation 250 (Total 2500) Specifics of the Regional Data: —-------------------------------------- Chittagong: —------------ Bangla: Train 1875, Test 375, Validation 250 (Total 2500) Banglish: Train 1875, Test 375, Validation 250 (Total 2500) Noakhali: —--------- Bangla: Train 1875, Test 375, Validation 250 (Total 2500) Banglish: Train 1875, Test 375, Validation 250 (Total 2500) Sylhet: —------ Bangla: Train 1875, Test 375, Validation 250 (Total 2500) Banglish: Train 1875, Test 375, Validation 250 (Total 2500) Barishal: —--------- Bangla: Train 1875, Test 375, Validation 250 (Total 2500) Banglish: Train 1875, Test 375, Validation 250 (Total 2500) Mymensingh: —--------------- Bangla: Train 1875, Test 375, Validation 250 (Total 2500) Banglish: Train 1875, Test 375, Validation 250 (Total 2500)

Files

Institutions

Ahsanullah University of Science and Technology

Categories

Natural Language Processing, Machine Translation, Dialect, Bangladesh, Deep Learning

Licence