Speech Recognition Datasets for Congolese Languages
This dataset contains two new benchmark corpora designed for low-resource languages spoken in the Democratic Republic of the Congo: The Lingala Read Speech Corpus LRSC, with 4.3 hours of labelled audio, and the Congolese Speech Radio Corpus CSRC, which offers 741 hours of unlabeled audio spanning four significant low-resource languages of the region (Lingala, Tshiluba, Kikongo and Congolese Swahili). Collecting speech and audio for this dataset involved two sets of processes: (1) for LRSC, 32 Congolese adult participants were instructed to sit in a relaxed manner within centimetres of an audio recording device or smartphone and read from the text utterances; (2) for CSRC, recording from the archives of a broadcast station were pre-processed and curated. Congolese languages tend to fall into the “low-resource” category, which, in contrast to “high-resource” languages, has fewer datasets accessible, limiting the development of Conversational Artificial Intelligence. This results in creating the speech recognition datasets for low-resource Congolese languages. The proposed dataset contains two sections. The first section involves training a supervised speech recognition module, while the second involves pre-training a self-supervised model. Both sections feature a wide variety of speech and audio taken in various environments, with the first section featuring a speech having its corresponding transcription and the second featuring a collection of pre-processed raw audio data.
Pan African University