Swahili Corpus

Published: 24 January 2024| Version 2 | DOI: 10.17632/d4yhn5b9n6.2
Noel Masasi, Bernard Masua


The repository contains several text files each corresponding to categories of Swahili Corpus. The categories are Health (AFYA), Business and Industries (BIASHARA), Parliament (BUNGE), Religion (DINI), Education (ELIMU), News (HABARI), Agriculture (KILIMO), Social Media (MITANDAO), Non-Governmental Organizations (MASHIRIKA YA KIRAIA), Government (SERIKALI), Laws (SHERIA) and Politics (SIASA). Also, there is a combines text file for Swahili Corpus generated using all mentioned categories.


Steps to reproduce

1. Identification of Categories 2. Data Collection from Official Sources 3. Download PDF and DOCX Documents 4. Python Script for Merging Documents 5. Cleaning Script 6. Generate output category statistics


Natural Language Processing, Machine Learning, Swahili Language, Text Processing, Corpus Analysis