Speech Recognition Datasets for Congolese Languages

Name: Speech Recognition Datasets for Congolese Languages
Creator: Ussen Kimanuka
Published: 2023-09-22T14:14:50.611Z
Keywords: Speech Processing, Natural Language Processing, Deep Learning, Self-Supervised Learning

Kimanuka, Ussen; wa Maina, Ciira; Büyük, Osman

doi:10.17632/28x8tc9n9k.1

Speech Recognition Datasets for Congolese Languages

Published: 22 September 2023| Version 1 | DOI: 10.17632/28x8tc9n9k.1

Contributors:

,

Description

This dataset contains two new benchmark corpora designed for low-resource languages spoken in the Democratic Republic of the Congo: The Lingala Read Speech Corpus LRSC, with 4.3 hours of labelled audio, and the Congolese Speech Radio Corpus CSRC, which offers 741 hours of unlabeled audio spanning four significant low-resource languages of the region (Lingala, Tshiluba, Kikongo and Congolese Swahili). Collecting speech and audio for this dataset involved two sets of processes: (1) for LRSC, 32 Congolese adult participants were instructed to sit in a relaxed manner within centimetres of an audio recording device or smartphone and read from the text utterances; (2) for CSRC, recording from the archives of a broadcast station were pre-processed and curated. Congolese languages tend to fall into the “low-resource” category, which, in contrast to “high-resource” languages, has fewer datasets accessible, limiting the development of Conversational Artificial Intelligence. This results in creating the speech recognition datasets for low-resource Congolese languages. The proposed dataset contains two sections. The first section involves training a supervised speech recognition module, while the second involves pre-training a self-supervised model. Both sections feature a wide variety of speech and audio taken in various environments, with the first section featuring a speech having its corresponding transcription and the second featuring a collection of pre-processed raw audio data.

Speech Recognition Datasets for Congolese Languages

Description

Files

Institutions

Categories

Funders

Related Links

Licence