Swahili Verb Conjugation Dataset: A Comprehensive Analysis of Agglutination and Verb Structure Across Tenses and Persons

Published: 15 January 2025| Version 3 | DOI: 10.17632/rvt89578g5.3
Contributors:
,

Description

The Swahili Verb Conjugation Dataset is an extensive resource containing over 319,156 meticulously compiled verb forms, designed to capture the intricate agglutinative morphology of Swahili. This Bantu language, widely spoken across East Africa, features a highly developed inflectional system in which verbs are modified through prefixes and suffixes to encode grammatical categories such as tense, aspect, mood, person, and number. Dataset Overview The dataset is provided as a single CSV file, with each row representing a unique verb root (mzizi_wa_neno) and its corresponding conjugated forms across various linguistic dimensions: Tenses The dataset covers five fundamental tenses—past, perfect, present, future, and simple present—each essential for understanding the temporal structure of Swahili. These tenses exhibit significant differences from their English counterparts, making the dataset particularly valuable for natural language processing (NLP) tasks requiring precise tense handling. Persons and Numbers Conjugations are provided for the 1st, 2nd, and 3rd persons in both singular (umoja) and plural (wingi) forms. Each person is conjugated across all five tenses, offering a comprehensive representation of subject-verb agreement in Swahili. Moods The dataset incorporates a range of moods, including the habitual mood (hali_ya_mazoea) and various auxiliary and hypothetical forms. These include modal constructs like kum (ability), kuwa (to be), and conditional forms such as ninge, unge, ange. Dataset Structure The dataset includes the following columns: Verb Root (mzizi_wa_neno): The base form from which all conjugated forms are derived. Conjugated Forms: Columns detailing conjugations for the 1st, 2nd, and 3rd persons in singular and plural forms across all tenses. For example, nafsi_ya_kwanza_umoja_wakati_uliopita specifies the 1st person singular in the past tense. Applications This dataset is an invaluable resource for both computational and theoretical linguistic research: Natural Language Processing: The morphological richness of Swahili verbs makes the dataset particularly suited for NLP tasks, including tokenization, lemmatization, syntactic parsing, and machine translation. Linguistic Analysis: Researchers can use the dataset to study Swahili’s verb morphology, tense-aspect systems, and comparative analyses with other agglutinative languages. The dataset’s comprehensive coverage of conjugations, including auxiliary and hypothetical forms, ensures its utility for a wide range of applications, from building robust language models to exploring cross-linguistic phenomena in morphology and syntax.

Files

Steps to reproduce

1. Data Collection: Collect verb roots from Swahili literary sources and linguistic resources. 2. Manual Conjugation: Manually conjugate each verb across five tenses (past, present, perfect, future, habitual) and three persons (1st, 2nd, 3rd) in both singular and plural forms. 3. Data Entry: Organize the conjugated forms into a CSV file with one column for the verb root and separate columns for each conjugation form. 4. Data Validation: Review the conjugations against Swahili grammar rules to ensure accuracy. 5. Automation with Python: Use Python scripts to automate data consistency checks and proper formatting of the CSV file. 6. Final Review: Conduct a manual check of the data to ensure the final dataset is error-free.

Categories

Artificial Intelligence, Natural Language Processing, Swahili Language

Licence