BRADS: A Multipurpose Audio Dataset For Bangla Regional Word Detection
Description
Research Hypothesis This dataset tests the hypothesis that Bangla ASR performance is affected by regional dialects and pronunciation variations. Despite Bangla’s widespread use, speech recognition models struggle with dialect diversity. This dataset enables the development of more accurate and inclusive ASR models. DATA SUMMARY The dataset contains 298 Bangla words (233 regional, 65 chaste Bangla) recorded by 85 native speakers from eight divisions, resulting in 2,439 high-quality audio samples. Data formats include .wav (audio files) and .xlsx (text data). Words were recorded using recommended apps, verified manually, and include background noise for real-world ASR training. NOTABLE FINDINGS Pronunciation varies significantly across regions. For example, the word "আমি" (Ami) [I] differs: Chittagong: "আই" (Ayi) Barisal: "মুই" (Mui) Rajshahi: "আমাক" (Amak) Rangpur: "হামি" (Hami) Chittagong has the most dialectal variance, while Rangpur is closest to chaste Bangla. Most contributions came from ages 23-27, indicating generational trends. HOW TO USE THE DATA The dataset can be used for ASR training in CNN, RNN, and Transformer models. It also improves NLP applications, chatbots, and speech-to-text systems. Linguists can study phonetic variations, and it enhances Bangla-English machine translation. VALUE OF THE DATA This dataset fills a gap in Bangla ASR research, supporting inclusive AI development and linguistic diversity preservation. It serves as a benchmark for Bangla speech-based AI, making voice technology more accessible
Files
Steps to reproduce
DATA COLLECTION & REPRODUCTION PROCESS A structured methodology was followed to ensure a systematic and reproducible data collection process. The dataset was developed using a well-defined approach, covering participant selection, word collection, recording procedures, data processing, and annotation. The dataset was created using 85 native Bangla speakers from eight divisions of Bangladesh, covering Dhaka, Chattogram, Barisal, Mymensingh, Rajshahi, Sylhet, Rangpur, and Khulna. The words were selected based on local surveys, Google Forms, and email responses, ensuring linguistic authenticity. A total of 298 frequently used Bangla words were recorded, including 233 regional words and 65 chaste Bangla words. The majority of participants were aged 23-27, representing a dominant age group in language usage trends. Participants recorded words using Easy Recorder (Android), Hokusai 2 (iOS), and Raw Recorder (Web Application), ensuring high-quality .wav format audio. They were instructed to speak clearly and pause for one second between words. To enhance real-world ASR training, background noise was deliberately included in select recordings. Once data was collected, it underwent a rigorous filtering and annotation process. From the 2,980 initially recorded samples, 2,439 were retained after filtering out mispronounced words, poor-quality recordings, and incorrect word splits. A custom audio segmentation algorithm was used to separate words from continuous speech, followed by manual verification by university students to ensure accuracy and dialect correctness. To reproduce this dataset, researchers should first identify regional words through surveys or interviews with native speakers. The same recording process should be followed using the recommended applications while maintaining consistent quality in .wav format. The collected speech data must be processed using silent segmentation techniques to extract individual words and then manually verified for accuracy. The final dataset should be structured with audio files organized by region and word, and supporting text-based metadata stored in .xlsx format. By following these steps, researchers can replicate, validate, and expand this dataset for future ASR, NLP, and linguistic studies.