ChatgaiyyaAlap: A Dataset for Conversion from Chittagonian Dialect to Standard Bangla
Description
Recently, a large number of research has been done on different language conversions from standard Bangla. However, only a limited number of effective works have been done in Bangla dialect conversion. We developed the “ChatgaiyyaAlap” dataset to convert the Chittagongian dialect into standard Bangla. The dataset has two Comma Separated Values (.CSV) files. The first file is for Chittagonian and Bangla sentences. This file contains two columns: one is for Standard Bangla sentences, and the other one is for Chittagonian sentences. For both columns, each row contains sentences in Standard Bangla and their translations in the Chittagonian dialect. The other file contains word mapping of the Chittagonian dialect and standard Bangla which is our state-of-the-art dictionary file. The Chittagonian sentences, in the first CSV file, were collected from diverse sources like Youtube and Facebook posts, comments, videos, short films, and dramas in the Chittagongian dialect. After data collection and preprocessing, we evaluated our collected data through five professional human evaluators who are native speakers of the Chittagong dialect and also know the standard Bangla language. Assembling sentences in the Chittagongian dialect was a slow process, where resource limitation was our major drawback. To speed up our process of data collection, we started to gather Bangla sentences from different social media sites and then translate those sentences into Chittagongian dialect with the assistance of five native speakers. As we verified and translated the data from five different speakers, there is a chance to use more than one synonym for a Bangla word. We tried to use more noticeable terms in our dataset rather than using alternative synonyms for the same phrase in order to avoid any misunderstandings. To keep the system simple and improve the translation process, we have maintained a dictionary file that helps us to select the proper Chittagonian word for a standard Bangla word. So the total dataset consists of two files one is Chittagong and Bangla sentences and the other one is a dictionary file.