The Swahili Digraph Corpus

Name: The Swahili Digraph Corpus
Creator: Tirus Muya
Published: 2024-11-08T12:50:07.928Z
Keywords: Computational Linguistics, Phonetics, Transcription, Natural Language Processing, Speech Recognition, Machine Learning, Swahili Language

Muya, Tirus

doi:10.17632/pttfc9cyrt.2

The Swahili Digraph Corpus

Published: 8 November 2024| Version 2 | DOI: 10.17632/pttfc9cyrt.2

Contributor:

Description

The Swahili Digraph Corpus is a comprehensive dataset crafted to capture the diverse phonetic elements of Swahili language, serving as a critical resource for natural language processing (NLP) and machine learning research. This corpus covers a broad array of Swahili digraphs which includes “ch,” “dh,” “gh,” “kh,” “ng’,” “ny,” “sh,” “th,” and “ng” which are essential for accurately representing Swahili phonetic nuances. With a detailed annotation of each digraph's frequency across the vowels “a,” “e,” “i,” “o,” and “u,” the corpus provides an extensive foundation for model training, testing, and validation. The dataset’s distribution, including 9,483 instances of “ch” and a balanced 11,604 instances of “ng,” ensures that machine learning models can effectively generalize across vowel contexts, which is essential for robust digraph recognition. Comprising 31,197 annotated words, the corpus also includes rare digraphs like “kh” and “ng’,” allowing models to learn both common and less frequent Swahili sounds, thus supporting nuanced phonetic recognition. By integrating a rich range of Swahili phonetic patterns, the corpus enhances the development of precise, context-sensitive Swahili language processing models, advancing research in Swahili NLP.

Files

Steps to reproduce

Corpus Development Methodology 1. Identification and Categorization of Acoustic and Phonetic Features of Swahili Digraphs The study systematically identified and categorized key Swahili digraphs, specifically "ch," "dh," "gh," "kh," "ng’," "ny," "sh," "th," and "ng." These digraphs were analysed based on their distinctive phonetic and acoustic properties, which are essential for accurate recognition in speech data. The analysis emphasized the frequency of these digraphs in relation to the five primary vowels of Swahili: "a," "e," "i," "o," and "u." Each digraph was classified according to its unique voicing, articulatory features, and phonetic context to enhance the understanding of its role in the phonetic structure of Swahili. 2. Development of a CNN-Based Digraph Extraction Model To effectively extract and recognize Swahili digraphs from the secondary corpus, a Convolutional Neural Network (CNN) model was developed. The architecture of the CNN model comprised several key components: i. Convolutional Layers: These layers capture patterns in speech signals linked to digraphs. ii. Dense Layers: These layers help in recognizing complex patterns and classifying digraphs. iii. Normalization Layers: These layers adjust input features to ensure the model is stable during training and use. iv. One-Hot Encoding: This method converts digraph labels into binary vectors, making them compatible with machine learning algorithms and helping to train the convolutional neural network (CNN) effectively. 3. Evaluation of Model Performance The performance of the CNN-based digraph extraction model was evaluated using several key metrics: i. Mean Absolute Error (MAE): This metric assessed the average magnitude of prediction errors. ii. Root Mean Squared Error (RMSE): This measured accuracy while emphasizing larger deviations. iii. R-squared Value: This quantified the proportion of variance explained by the model. iv. Test Loss: This evaluated the model's ability to generalize to unseen data. The model achieved a high R-squared value of 0.89, awith low MAE and RMSE values and minimal test loss, indicating that it effectively recognized and classified Swahili digraphs with impressive accuracy and efficiency. 4. Creation and Annotation of the Swahili Digraph Corpus A Swahili digraph corpus has been created to support the training, validation, and testing of a CNN-based model. This corpus includes 31,197 annotated words, which provide: i. The frequency of occurrence for each digraph. ii. The distribution of vowels (“a,” “e,” “i,” “o,” and “u”) within each digraph. Structured to ensure balanced representation across vowels, this corpus serves as a valuable resource for tasks related to Swahili digraph recognition and establishes a solid foundation for advancements in Swahili speech recognition.

Institutions

Murang'a University School of Computing and Information Technology, Murang'a University of Technology

The Swahili Digraph Corpus

Description

Files

Steps to reproduce

Institutions

Categories

Licence