BRWDS: A Multipurpose Dataset For Bangla Regional Word Detection

Published: 2 December 2024| Version 3 | DOI: 10.17632/6pd2c48m66.3
Contributors:
,
, Hana Sultan Chowdhury,
,

Description

The BRWDS (Bangla Regional Word Dataset) is a comprehensive collection of commonly used Bengali words that highlights the linguistic diversity across 8 distinct divisions in Bangladesh. This dataset aims to tackle the challenges posed by regional accents and variations in Bengali, which can create barriers to communication. The dataset covers words from the following divisions: Dhaka, Chittagong, Mymensingh, Sylhet, Rajshahi, Khulna, Barishal, and Rangpur. In total, it includes 347 Bengali words that are frequently used in daily conversations across these regions. While Bengali is spoken across all these divisions, each region has its own unique accent, leading to variations in pronunciation and word usage, which are captured in this dataset. To create this dataset, 12 native speakers from the 8 divisions, as well as one additional district, contributed by providing word samples. The data is stored in XLSX format, making it easily accessible for further research. This dataset has several potential applications, including the development of systems that can automatically detect regional variations in Bengali text, enabling better localization and understanding of regional dialects. It can also help minimize communication barriers caused by accent differences within Bangladesh by offering a more standardized understanding of regional variations. Additionally, the dataset can be used to translate regional words into standard Bengali (Chaste Bengali), making it easier for people to understand each other. The dataset also supports research into linguistic diversity and provides a foundation for future advancements in speech and text processing technologies. The dataset has been reviewed and evaluated by 9 authentic speakers from each division to ensure its accuracy and proper representation of the regional language variations. Looking forward, the dataset can be further enriched by adding voice data, which would support more advanced research in areas such as speech recognition, accent detection, and machine translation for regional language variants. Data was situated in Bangla RDS.xlsv . In sheet 1 named Region wise data was collected and evaluated on other sheet named categorize data where all the data was categorized and organize according to common chaste words.

Files

Institutions

Independent University

Categories

Artificial Intelligence, Natural Language Processing, Speech Recognition, Word Recognition

Licence