Speech Recognition Datasets for Congolese Languages

Published: 22 September 2023| Version 1 | DOI: 10.17632/28x8tc9n9k.1


This dataset contains two new benchmark corpora designed for low-resource languages spoken in the Democratic Republic of the Congo: The Lingala Read Speech Corpus LRSC, with 4.3 hours of labelled audio, and the Congolese Speech Radio Corpus CSRC, which offers 741 hours of unlabeled audio spanning four significant low-resource languages of the region (Lingala, Tshiluba, Kikongo and Congolese Swahili). Collecting speech and audio for this dataset involved two sets of processes: (1) for LRSC, 32 Congolese adult participants were instructed to sit in a relaxed manner within centimetres of an audio recording device or smartphone and read from the text utterances; (2) for CSRC, recording from the archives of a broadcast station were pre-processed and curated. Congolese languages tend to fall into the “low-resource” category, which, in contrast to “high-resource” languages, has fewer datasets accessible, limiting the development of Conversational Artificial Intelligence. This results in creating the speech recognition datasets for low-resource Congolese languages. The proposed dataset contains two sections. The first section involves training a supervised speech recognition module, while the second involves pre-training a self-supervised model. Both sections feature a wide variety of speech and audio taken in various environments, with the first section featuring a speech having its corresponding transcription and the second featuring a collection of pre-processed raw audio data.



Jomo Kenyatta University of Agriculture and Technology


Speech Processing, Natural Language Processing, Deep Learning, Self-Supervised Learning


Pan African University