Karakalpak Speech Corpus

Published: 30 December 2025| Version 1 | DOI: 10.17632/2th8jvft8f.1
Contributors:
,
, Kabul Khudaybergenov

Description

The Karakalpak Speech Corpus is the first large-scale, publicly available speech-to-text dataset for the Karakalpak language, designed to support the development, evaluation, and benchmarking of automatic speech recognition (ASR) systems for this low-resource Turkic language. Research hypothesis The core hypothesis behind this dataset is that high-quality, carefully curated speech–text pairs, even at moderate scale, can enable state-of-the-art self-supervised models (such as Wav2Vec 2.0) to achieve strong recognition performance for low-resource languages. By providing sufficient phonetic, lexical, and speaker diversity, the corpus aims to bridge the data gap that has historically limited Karakalpak speech technology. What the data contains The dataset consists of: Speech recordings in WAV format (16 kHz, 16-bit PCM) Manually verified transcriptions in standard Karakalpak Latin orthography Speaker-independent splits for training, validation, and testing Each audio file corresponds to a single utterance, making the corpus suitable for end-to-end ASR, forced alignment, pronunciation modeling, and acoustic analysis. The recordings include: Read speech Conversational and narrative sentences Phonetically rich word sequences Numbers, commands, and daily expressions This ensures broad coverage of Karakalpak phonology, morphology, and vocabulary. How the data was gathered The corpus was collected from native Karakalpak speakers under controlled recording conditions. All recordings were made in quiet indoor environments using consumer-grade microphones and laptops at 16 kHz. Speakers were instructed to read predefined texts clearly and naturally. All transcriptions were manually checked and normalized to remove spelling inconsistencies, Unicode artifacts, and non-Karakalpak characters. This results in a clean and reproducible linguistic representation of spoken Karakalpak. What the data shows The dataset demonstrates that: Karakalpak phonemes and special letters (á, ó, ú, ı, ń, ś, ǵ) can be reliably captured and modeled A consistent orthography and vocabulary can be established for ASR training Speaker-independent evaluation is feasible When used to fine-tune Wav2Vec 2.0 models, the corpus produces low word error rates (WER) and character error rates (CER), confirming that the dataset contains sufficient acoustic and linguistic information for high-quality speech recognition.

Files

Steps to reproduce

This dataset was created through a controlled speech data collection, processing, and validation pipeline designed to support automatic speech recognition (ASR) research for the Karakalpak language. 1. Speaker Recruitment and Text Preparation Native Karakalpak speakers were recruited from different age groups and regions to ensure linguistic diversity. A balanced set of sentences was prepared, covering: Common vocabulary Phonetically rich words Formal and informal speech Numbers, commands, and conversational phrases The texts were manually reviewed to ensure correct Karakalpak orthography and consistency. 2. Audio Recording Protocol Speech was recorded using: High-quality USB microphones and laptop microphones Quiet indoor environments Recording settings: Sampling rate: 16 kHz Bit depth: 16-bit Format: WAV (PCM) Each speaker was instructed to: Read each sentence clearly at a natural speaking rate Avoid background noise and microphone distortion Repeat the sentence if pronunciation errors occurred Each utterance was recorded as a separate WAV file. 3. File Organization and Transcription Each audio file was paired with a corresponding text transcription. The dataset structure follows: filename.wav, transcription All transcriptions were manually verified by native speakers to ensure: Correct spelling Consistent orthography No mismatches between audio and text 4. Text Normalization Before model training, all transcriptions were normalized using a custom Python preprocessing pipeline: Lower-casing Removal of non-Karakalpak symbols Removal of hidden Unicode characters Consistent representation of special Karakalpak letters (e.g., á, ó, ú, ı, ń, ś, ǵ) This guarantees that the vocabulary is stable and reproducible. 5. Dataset Splitting The corpus was split into: Training set Validation set Test set Speakers were kept disjoint between splits to prevent speaker leakage and to ensure a realistic evaluation of ASR performance. 6. Reproducible Model Training To reproduce the reported ASR results, the following pipeline was used: Base model: Wav2Vec 2.0 Framework: PyTorch + HuggingFace Transformers Fine-tuning using CTC loss Sampling rate: 16 kHz Token set derived directly from corpus characters Training scripts and configuration files are provided together with the dataset. 7. Evaluation Model performance was evaluated using: Word Error Rate (WER) Character Error Rate (CER) Evaluation was performed on the held-out test set only.

Institutions

Yeoju Technical Institute in Tashkent

Categories

Speech Recognition

Licence