Swahili Corpus
Published: 24 January 2024| Version 2 | DOI: 10.17632/d4yhn5b9n6.2
Contributors:
Noel Masasi, Bernard MasuaDescription
The repository contains several text files each corresponding to categories of Swahili Corpus. The categories are Health (AFYA), Business and Industries (BIASHARA), Parliament (BUNGE), Religion (DINI), Education (ELIMU), News (HABARI), Agriculture (KILIMO), Social Media (MITANDAO), Non-Governmental Organizations (MASHIRIKA YA KIRAIA), Government (SERIKALI), Laws (SHERIA) and Politics (SIASA). Also, there is a combines text file for Swahili Corpus generated using all mentioned categories.
Files
Steps to reproduce
1. Identification of Categories 2. Data Collection from Official Sources 3. Download PDF and DOCX Documents 4. Python Script for Merging Documents 5. Cleaning Script 6. Generate output category statistics
Categories
Natural Language Processing, Machine Learning, Swahili Language, Text Processing, Corpus Analysis