YembaTones: An Annotated Dataset for Tonal and Syllabic Analysis of the Yemba Language

Published: 25 October 2023| Version 3 | DOI: 10.17632/cx268tmrwn.3


YembaTones is a meticulously annotated dataset that focuses on tonal and syllabic variations in the Yemba language. It was created to facilitate automatic tone detection and enhance resources available for speech recognition and synthesis in this tonal language. This dataset is derived from a dictionary containing 344 Yemba/French words, carefully selected from commonly used phrases in the language. The words are grouped based on their spelling differences in terms of tones. Audio recordings of the pronunciation of these words were made by 11 native Yemba speakers, primarily linguistics specialists with a strong command of the language's sounds. The recordings were captured in various locations such as speakers' homes, university campuses, and workplaces. Subsequently, the recordings were cleaned and segmented into individual audio files corresponding to isolated word pronunciations using Audacity software. The YembaTones dataset consists of 3420 high-quality audio files that have been meticulously annotated at the syllabic and tonal levels using Praat software. It serves as a valuable resource not only for training and evaluating automatic tone detection models, but also for automatic speech recognition, speech synthesis in tonal and low-resource languages, as well as research in prosody, Yemba phonetics, speech acoustics, and phonetic linguistics. YembaTones provides a comprehensive foundation for further advancements in tonal analysis, speech technology, and linguistic research for the Yemba language. By addressing the scarcity of resources in this domain, this dataset paves the way for the development of more accurate and effective speech processing applications for tonal languages.



Universite de Yaounde I


Computer Science, Artificial Intelligence, Speech Processing, Machine Learning