BengVoice: A Stratified Dataset of Code-Mixed Bengali-English Voice Commands for Intent Classification in Conversational AI Systems

Published: 26 February 2026| Version 1 | DOI: 10.17632/sr99ryf4ns.1
Contributor:
Md Shahriar Hossain

Description

This dataset presents a meticulously curated benchmark collection of 1,200 Bengali voice assistant utterances for intent classification research in conversational AI systems. BengVoice addresses the critical gap in Natural Language Understanding resources for Bengali, one of the world's most widely spoken languages with over 230 million speakers, yet significantly underrepresented in publicly available language technology datasets. The dataset comprises utterances across 10 fundamental voice assistant intent categories: weather queries, time queries, alarm setting, news requests, music playback, phone calls, messaging, translation, calculations, and general knowledge questions. Each intent category contains exactly 120 samples, ensuring perfect class balance. All 1,200 utterances are unique with zero duplicates. A distinguishing feature is authentic code-mixing behaviour—natural integration of English words within Bengali speech. Analysis reveals 290 samples (24.2%) contain code-mixed content, with patterns reflecting genuine usage: technical domains like alarm setting show 71.7% code-mixing, while traditional domains show minimal mixing (0.8%). This reflects natural speech patterns of urban Bengali speakers in Bangladesh. The dataset incorporates cultural authenticity through references to Bangladeshi locations (Dhaka, Chittagong, Sylhet), local media (Prothom Alo, Kaler Kantho), and cultural elements specific to Bangladesh, ensuring real-world usage scenarios for Bengali-speaking populations. For robust evaluation, the dataset provides stratified 5-fold cross-validation splits. Each fold contains exactly 240 samples with 24 per intent, maintaining perfect balance. This stratification enables fair model comparison and supports multiple evaluation methodologies including traditional machine learning, deep learning, retrieval-augmented generation (RAG), and few-shot prompting. Baseline validation experiments using TF-IDF vectorization with character-level n-grams and Logistic Regression achieved mean accuracy of 93.92% (±0.50%) across 5-fold cross-validation, with fold accuracies from 93.33% to 94.58%. Per-intent performance ranged from 81.67% (news requests) to 100% (translation), establishing clear benchmarks and validating dataset quality. The dataset is provided in multiple formats: complete datasets in JSON and CSV (with and without fold labels), individual fold files for pre-separated evaluation. No proprietary software required. This resource enables Bengali voice assistant development, intent classification benchmarking, code-mixing investigation, cross-lingual transfer learning, multilingual NLU systems, and low-resource language processing. Released under Creative Commons Attribution 4.0 International (CC BY 4.0) license for maximum research impact.

Files

Steps to reproduce

This dataset was developed through systematic collection and validation of Bengali voice assistant utterances. Step 1 - Intent Taxonomy: Ten voice assistant intent categories identified: weather queries, time queries, alarm setting, news requests, music playback, phone calls, messaging, translation, calculations, and general knowledge questions. Step 2 - Utterance Generation: Native Bengali speaker generated natural utterances (120 per intent) with authentic linguistic patterns, Bangladeshi cultural references, and realistic code-mixing behaviour. Step 3 - Quality Assurance: Validation for grammatical correctness, intent alignment, duplicate detection (zero duplicates verified), length constraints (10-54 characters), and code-mixing authenticity (290 samples, 24.2%). Step 4 - Stratification: Data divided into 5 cross-validation folds using stratified sampling. Each fold contains 240 samples with 24 samples per intent. Step 5 - Baseline Validation: TF-IDF vectorization (character-level n-grams 2-4) with Logistic Regression. Mean accuracy: 93.92% (±0.50%), fold accuracies: 93.33% to 94.58%, per-intent: 81.67% to 100%. Step 6 - File Preparation: Exported in JSON and CSV formats (complete and individual folds). No proprietary software required.

Categories

Computer Science, Artificial Intelligence, Computational Linguistics, Data Science, Natural Language Processing, Artificial Intelligence Applications, Deep Learning, Few-Shot Learning, Voice Assistant, Retrieval-Augmented LLM

Licence