A Curated Crowdsourced Dataset of Luganda and Swahili Speech for Text-to-Speech Synthesis
Description
This dataset contains curated and preprocessed speech recordings in Luganda and Kiswahili for use in text-to-speech (TTS) research. The audio and transcripts were sourced from Mozilla Common Voice (Luganda v12.0 and Kiswahili v15.0) and curated for voice consistency and quality. This dataset is designed for training and evaluating end-to-end TTS systems in low-resource African languages. The data is organized into two folders — Luganda and Kiswahili — each containing: wavs.zip: A ZIP archive of .wav audio files from six selected female speakers per language. All audio files have been silence-trimmed, denoised using a causal DEMUCS model, and filtered using WV-MOS to retain only clips with a predicted MOS ≥ 3.5. metadata.csv: A CSV file with two columns: filename and transcript. Each row corresponds to an audio file in the wavs.zip archive and provides the spoken sentence for that clip.
Files
Steps to reproduce
Download the speech data for Luganda (v12.0) and Kiswahili (v15.0) from Mozilla Common Voice. From each language set, identify the top 20 female speakers by utterance count. Manually review a sample of five clips per speaker and select six speakers per language based on similarities in pitch, rhythm, and speaking style. Optionally, extract acoustic features such as pitch and mel-frequency cepstral coefficients (MFCCs) using Python libraries like librosa, and use k-means clustering to confirm acoustic similarity. Next, preprocess each audio clip by trimming silences using the webrtcvad tool, denoising with a causal waveform-based DEMUCS model (available at https://github.com/facebookresearch/denoiser), and filtering for quality using the WV-MOS scoring model (https://github.com/AndreevP/wvmos). Retain only clips with a predicted Mean Opinion Score (MOS) of 3.5 or higher. Save the cleaned .wav files in a language-specific folder and create a metadata.csv file with two columns: filename and transcript, ensuring that all entries match the final audio set.