A Comprehensive Kurdish Speech Corpus for Speaker Identification and Verification
Description
Abstract / General Description: This dataset comprises a proprietary acoustic corpus specifically developed for text-independent Speaker Identification and Verification (SIV) within a low-resource language environment (Central Kurdish). The dataset contains 86,505 discrete audio utterances recorded from 200 demographically diverse native speakers. It is designed to address the critical data deficiency in underrepresented computational linguistics and provides a robust empirical foundation for training deep learning biometric architectures. The data is structurally optimized for researchers extracting high-dimensional acoustic representations, specifically 2D log Mel-spectrograms, to execute spatial feature-learning via Convolutional Neural Networks (CNNs). Data Collection and Preprocessing Metrics: Audio Format: Raw audio stored in .ogg format. Utterance Duration: Uniformly normalized to 1.0 second per clip to effectively capture invariant phonetic variations while ensuring computational parameter efficiency. Volume: 86,505 total independent acoustic samples. Density: A minimum of 400 discrete audio samples per individual participant Dataset Partitioning (Train/Validation/Test Split): The dataset partitioning technique carefully separates the testing environment from the training pipeline to prevent data leakage. It is structured into three distinct subsets: Training and Validation Sets (73,538 files): This aggregate subset is divided into an 85% training set and a 15% validation set. The split utilizes a stratified sampling method to meticulously preserve the proportionate representation of each speaker class throughout both subsets. Isolated Test Set (12,967 files): A separate directory of completely unseen audio samples assembled exclusively for final model assessment and cross-dataset evaluation protocols Demographic Distribution: Total Participants: 200 native speakers. Gender Split: 101 Male, 99 Female. Age Cohorts: Under 18: 6 participants 18–25: 47 participants 26–40: 82 participants 41–60: 62 participants Over 60: 3 participants Recommended Usage & Technical Implementation: This corpus is engineered for advanced audio-to-image classification tasks. It is empirically proven to support the extraction of Mel-spectrograms (configured to 64 Mel-frequency bins and 44 temporal frames) for training 2D-CNN topologies. The dataset structure facilitates rigorous cross-dataset evaluation protocols for both multi-class closed-set speaker identification and open-set, threshold-dependent biometric security verification.
Files
Steps to reproduce
This provides other researchers with the exact data engineering pipeline required to convert your raw audio into the CNN-ready formats described in your paper: Load the raw .ogg audio files utilizing a library such as Librosa, strictly maintaining a sample rate of 22.05 kHz. Standardize the audio duration to exactly 1.0 second. Apply zero-padding for shorter clips or truncation for longer clips. Extract the Mel-spectrogram utilizing 64 Mel-frequency filter banks (n_mels=64). Convert the resulting power spectrogram into a logarithmic decibel scale to amplify significant spectral patterns. Standardize the temporal axis to exactly 44 frames, resulting in a final 2D feature matrix shape of 64x44. Apply global normalization (zero mean, unit variance) calculated from the training subset prior to passing the tensor into a Convolutional Neural Network (CNN).
Institutions
- University of HalabjaSulaymaniyah, Halabja