Bangla Voice Dataset: Simple, Complex, and Compound Structures

Published: 2 September 2025| Version 3 | DOI: 10.17632/2wn7c48dtp.3
Contributors:
Md Abdullah-Al-Kafi Kafi,
,

Description

The dataset is a comprehensive resource designed for linguistic analysis, natural language processing (NLP), and speech recognition tasks specifically tailored for the Bangla language. It comprises the following key features: Textual Data: Sentence Types: The corpus includes a balanced collection of simple, complex, and compound sentences, carefully curated to represent diverse syntactic structures and real-world language usage in Bangla. Diversity: Sentences cover a wide range of topics and contexts, ensuring linguistic richness and variety. Voice Data: Audio Recordings: Each sentence is paired with high-quality voice recordings by native Bangla speakers, ensuring accurate pronunciation, intonation, and regional linguistic nuances. Annotation: Sentence Labeling: Each sentence is tagged as simple, complex, or compound, aiding in syntactic analysis and supervised learning applications. Applications: Speech Recognition and Synthesis: Ideal for training and evaluating speech-to-text and text-to-speech systems for Bangla. Language Modeling: Supports NLP tasks such as machine translation, sentiment analysis, and syntactic parsing. Educational Use: Useful for linguistic research, Bangla grammar teaching, and phonetic studies. Compliance: The dataset adheres to ethical guidelines, ensuring informed consent from all contributors. This dataset serves as a valuable asset for researchers, developers, and educators seeking to advance technologies and studies involving the Bangla language.

Files

Institutions

Daffodil International University

Categories

Linguistics, Computer Science, Natural Language Processing, Audio Analysis

Licence