Swahili Corpus

Published: 24 January 2024| Version 2 | DOI: 10.17632/d4yhn5b9n6.2
Contributors:
Noel Masasi, Bernard Masua

Description

The repository contains several text files each corresponding to categories of Swahili Corpus. The categories are Health (AFYA), Business and Industries (BIASHARA), Parliament (BUNGE), Religion (DINI), Education (ELIMU), News (HABARI), Agriculture (KILIMO), Social Media (MITANDAO), Non-Governmental Organizations (MASHIRIKA YA KIRAIA), Government (SERIKALI), Laws (SHERIA) and Politics (SIASA). Also, there is a combines text file for Swahili Corpus generated using all mentioned categories.

Files

Steps to reproduce

1. Identification of Categories 2. Data Collection from Official Sources 3. Download PDF and DOCX Documents 4. Python Script for Merging Documents 5. Cleaning Script 6. Generate output category statistics

Categories

Natural Language Processing, Machine Learning, Swahili Language, Text Processing, Corpus Analysis

Licence