YembaEGRA

Published: 7 November 2024| Version 1 | DOI: 10.17632/74p9d5frg3.1
Contributors:
,
,

Description

The Yemba language is a Bantu language spoken in the western region of Cameroon. It is one of the ten languages ​​spoken by the Bamileke peoples. The basic education system in Cameroon is made up of three levels: level 1: SIL-CEP, level 2: CE1-CE2, level 3: CM1-CM2 and lessons in “National Languages ​​and Cultures” (introduced in 2019) are guided by a curriculum which for each level contains all the teaching units as well as eight thematic fields (or centers of interest) around which learning takes place. This corpus was built for learning automatic speech recognition models that can be used to facilitate the learning and assessment of national languages ​​in the basic education system in Cameroon. The corpus of words available in this directory was formed for each center of interest by an educational facilitator who proposed a set of words. A linguist specializing in the Yemba language translated them to obtain a corpus of 60 words. These words were then pronounced twice by 69 native speakers, level 3 students including 36 girls and 33 boys. The recordings were carried out in classrooms and quiet rooms close to the public schools of Melah and Toudjoua (located in the village of Bamendou in the Menoua department, West region, Cameroon). In the metadata folder the corpus of words is present in a csv file named words_corpus. Information about each speaker is grouped in the speakers_description file in csv format including gender, age, class. The audio folder is divided into eight sub-folders named CI1 to CI8 each corresponding to a center of interest, within these folders we have three sub-folders named 1 to 3 for each level. Each of these subfolders contains the audio files of the words belonging to the center of interest and the level considered; These audios are grouped in subfolders named W1 to Wx (where x is the number of words of the center of interest). Each word folder contains audio files in wav format. Each audio file was named as follows: spkr_<speaker id>_word_<word id>_ occ_<occurence number>_ci_<area of ​​interest id>_l_<level id>.wav. For example, the files spkr_2_word_40_occ_1_ci_5_l_3.wav and spkr_2_word_40_occ_2_ci_5_l_3.wav correspond respectively to the files of occurrences 1 and 2 of word 40 belonging to center of interest 5, pronounced by speaker number 2 of level 3.

Files

Institutions

Universite de Yaounde I

Categories

Speech Recognition

Funding

Sorbonne Université - IRD - UMMISCO - F-93143, Bondy, France

European project H2020-MSCA-RISE-202 Esperanto

Licence