Toxicity Detection Dataset in Twi Language

Published: 9 January 2026| Version 2 | DOI: 10.17632/pvrdx7hwhz.2
Contributors:
Gifty Suuk Ali, Patrick Mensah Kwabena, Mighty Abra Ayidzoe, Ben Belklisi Kwame Ayawli

Description

This dataset contains 2,001 text entries labeled for toxicity classification. Each entry represents a user-generated comment along with an assigned toxicity label. The dataset is structured into two columns: COMMENT– A text field containing comments written primarily in Akan (Twi). These comments include expressions of gratitude, feedback, conversational messages, and general communication typical of social or online interactions. LABEL– A categorical variable indicating whether the comment is 'toxic' or 'non-toxic'. Current labels present in the dataset: 'non-toxic' (and any others present in the full file, if applicable). Key Features: • Total records: 2,001 • Language: Primarily Akan (Twi) • Classification type: Binary toxicity classification There are no missing values (both columns have 2,001 non-null entries) Data types: ‘COMMENT’: string and ‘LABEL’`: string This dataset can support research in: • Toxic language detection in low-resource languages • Natural Language Processing (NLP) for African languages • Machine learning model training for text classification • Sociolinguistic analysis of online conversational content The File Format is CSV file: Toxicity_dataset.csv It contains two columns: 'COMMENT' and ‘LABEL'

Files

Institutions

  • University of Energy and Natural Resources

Categories

Toxicity

Licence