A kiswahili Dataset for Development of Text-To-Speech System

Published: 30 November 2021| Version 1 | DOI: 10.17632/vbvj6j6pm9.1
Kiptoo Rono


The dataset contains Kiswahili text and audio files. The dataset contains 7,108 text files and audio files. The Kiswahili dataset was created from an open-source non-copyrighted material: Kiswahili audio Bible. The authors permit use for non-profit, educational, and public benefit purposes. The downloaded audio files length was more than 12.5s. Therefore, the audio files were programmatically split into short audio clips based on silence. They were then combined based on a random length such that each eventual audio file lies between 1 to 12.5s. This was done using python 3. The audio files were saved as a single channel,16 PCM WAVE file with a sampling rate of 22.05 kHz The dataset contains approximately 106,000 Kiswahili words. The words were then transcribed into mean words of 14.96 per text file and saved in CSV format. Each text file was divided into three parts: unique ID, transcribed words, and normalized words. A unique ID is a number assigned to each text file. The transcribed words are the text spoken by a reader. Normalized texts are the expansion of abbreviations and numbers into full words. An audio file split was assigned a unique ID, the same as the text file.


Steps to reproduce

1. The audio files for each bible chapter were downloaded. 2. The downloaded files were then split into a length of 1s to 12.5s using python programming language. 3. Each audio file was assigned a Unique ID and saved as a WAVE file. 4. A text file for matching each audio file was transcribed. 5. A unique ID was assigned to each text file, and Non-Standard Kiswahili words, including abbreviations, monetary units, and numbers, were expanded into their full forms. Therefore, the text file has 3 parts: unique ID, transcribed words, and normalized words. The text file was saved in CSV format.


Dedan Kimathi University of Technology


Data Science, Machine Learning, Deep Learning