RBVD: Regional Bangla Voice Dataset

Published: 29 July 2025| Version 1 | DOI: 10.17632/cwz5fgxznr.1
Contributors:
Mushfiqur Rahman,
,

Description

This dataset was collected through direct in-person voice recordings from eight districts of Bangladesh - Chottogram, Barishal, Rangpur, Noakhali, Mymensingh, Tangail, Jamalpur and Natore. It includes speech samples featuring regional language variations. A total of 119 sentences were used to capture diverse pronunciation patterns. This dataset is organized into 9 folders and one for each division. Only Rongpur has two folders, one is for females and another is for Male. Chottogram: 119 sentences are recorded in this district and the file size of the total sample is 16,196 kB. The total duration of the recordings is 421 sec. Mymensingh: This district contains 117 sentences.Total file size is 12,296 KB and the duration of 117 recordings is 389 sec. Rongpur_Male: This contains recordings of 119 regional sentences. File size of this district is 12,661 KB and the total duration of the sample is 378 sec. Rongpur_Female: Here 118 sentences are recorded and file size is 8,520 KB. Total duration of the sample is 263 sec Jamalpur: This contains recordings of 119 regional sentences. The file size of this district is 9,790 KB and duration is 308 sec. Noakhali: This includes recordings of 119 Bangla sentences, comprising 10,048 KB audio files. The total duration of all recorded samples is 274 sec . Tangail:Here 119 sentences are recorded and file size is 11,862 KB.Total duration of the sample is 304 sec. Barishal:119 sentences are recorded in this district and the file size of the total sample is 14,706 kB. Total duration of the recordings is 378 sec. Natore: This contains recordings of 118 regional sentences. File size of this district is 10,627 KB and the total duration of the sample is 331 sec. This dataset is designed to analyze and recognize regional variations in Bangla language, such as accents, pronunciation, and dialects. In the future, it can be used to: -Improve speech recognition systems by making them more accurate for different Bangla dialects. -Develop regional voice assistants and translation tools that understand local speech patterns. -Support linguistic research to preserve and study Bangla dialect diversity. -Train AI models for accent classification, speaker identification, and voice-based authentication systems. Data Source: 1,067 Data samples were collected for this study to ensure a diverse representation of regional speech patterns. Total Sample: 1,067 Total Duration: 3046 sec Total File Size: 106,706 KB

Files

Institutions

Daffodil International University

Categories

Linguistics, Computer Science, Sociolinguistics Variation

Licence