Swahili Verb Conjugation Dataset: A Comprehensive Analysis of Agglutination and Verb Structure Across Tenses and Persons

Published: 22 October 2024| Version 1 | DOI: 10.17632/rvt89578g5.1
Contributors:
,

Description

This Swahili Verb Conjugation Dataset offers a rich and detailed collection of over 319,156 conjugated verb forms, meticulously compiled to capture the complexity of Swahili’s agglutinative verb morphology. Swahili is known for its rich inflectional system, where verbs are modified by adding prefixes and suffixes to encode grammatical information such as tense, aspect, person, number, and mood. The dataset consists of a single CSV file, where each row represents a unique verb root (mzizi_wa_neno) and its conjugated forms across multiple dimensions: Tenses: The dataset captures five core tenses: past, perfect, present, future, and simple present. These tenses play a critical role in Swahili verb conjugation and vary significantly from English tense structures, making this dataset an essential resource for handling these tense markers in NLP tasks. Persons and Numbers: Conjugations are provided for the 1st, 2nd, and 3rd persons, both singular (umoja) and plural (wingi). Each of these persons is conjugated across the five tenses, providing a comprehensive overview of the morphological changes that occur depending on the subject. Moods: The dataset includes the habitual mood (hali_ya_mazoea), as well as other modal forms and auxiliary verbs that are part of Swahili’s verb system, such as kum (ability), kuwa (to be), and various hypothetical forms (e.g., ninge, unge, ange for conditional tense). The columns in the dataset include: Verb Root (mzizi_wa_neno): The base form of the verb from which all conjugated forms are derived. Conjugated Forms: These columns represent the verb conjugations for the 1st, 2nd, and 3rd persons, both singular and plural, across all tenses. For example, nafsi_ya_kwanza_umoja_wakati_uliopita refers to the 1st person singular in the past tense. This dataset not only provides standard conjugations for Swahili verbs but also covers various auxiliary and hypothetical forms. The extensive collection of forms makes this dataset an invaluable resource for researchers interested in Swahili Natural Language Processing (NLP), as it offers the morphological richness needed for tasks like tokenization, lemmatization, and syntactic parsing. Additionally, this dataset is adaptable for linguistic research beyond computational applications. It can be used to study Swahili verb morphology, tense-aspect systems, and cross-linguistic comparisons with other agglutinative languages.

Files

Steps to reproduce

1. Data Collection: Collect verb roots from Swahili literary sources and linguistic resources. 2. Manual Conjugation: Manually conjugate each verb across five tenses (past, present, perfect, future, habitual) and three persons (1st, 2nd, 3rd) in both singular and plural forms. 3. Data Entry: Organize the conjugated forms into a CSV file with one column for the verb root and separate columns for each conjugation form. 4. Data Validation: Review the conjugations against Swahili grammar rules to ensure accuracy. 5. Automation with Python: Use Python scripts to automate data consistency checks and proper formatting of the CSV file. 6. Final Review: Conduct a manual check of the data to ensure the final dataset is error-free.

Categories

Artificial Intelligence, Natural Language Processing, Swahili Language

Licence