Camer-Hate-FR: An Annotated Dataset for Hate Speech Detection in Cameroonian French.

Published: 23 April 2026| Version 3 | DOI: 10.17632/rjwttgp23m.3
Contributors:
,
,
,
,
, Donald Onana

Description

This dataset, titled Camer-Hate-FR, provides a valuable resource for detecting hate speech within the unique linguistic context of Cameroonian French. The data consists of 46,825 messages collected between January and June 2025 from public Cameroonian social media sources, including Facebook pages, YouTube channels, and WhatsApp groups. Existing hate speech detection models, primarily trained on standard European French, perform poorly on Cameroonian data due to the prevalent use of local slang, code-switching with English and indigenous languages (Camfranglais), and nuanced cultural contexts. This dataset was created to address this gap. Each message has been manually annotated by three native speakers as either 'hateful' or 'non-hateful', with the final label determined by a majority vote. Each entry includes the original text, annotation counts, the final vote, and the justifications provided by annotators. All data has been fully anonymized to protect user privacy. The dataset is provided in three versions: camer_hate_fr_dataset.csv — original Cameroonian French version, with labels haineux / non_haineux cameroon_hate_speech_UK_English.csv — full translation in British English (spelling: recognise, offence, cancelled), with labels hateful / non_hateful cameroon_hate_speech_US_English.csv — full translation in American English (spelling: recognize, offense, canceled), with labels hateful / non_hateful This resource is designed to train, validate, and benchmark machine learning models for content moderation, facilitate sociolinguistic analysis, and spur the development of more inclusive and effective NLP technologies for Francophone Africa.

Files

Steps to reproduce

The dataset was created following a multi-step process: 1. Data Collection: Messages were collected from public Cameroonian social media platforms (Facebook, YouTube, WhatsApp) between January and June 2025. Python scripts using libraries such as BeautifulSoup, Selenium, and the YouTube Data API were employed. 2. Data Cleaning: The raw text was preprocessed to remove irrelevant artifacts, including emojis, URLs, user mentions, and excessive special characters. The data was then fully anonymized to protect user privacy. 3. Annotation: A custom web application was used for manual annotation. A team of 20 students from the University of Yaoundé I and volunteer contributors annotated each message. Each message received three separate annotations, and textual justifications were collected from each annotator. 4. Label Aggregation: The final label (hateful or non_hateful) for each message was determined by a majority vote among the three annotations. The textual justifications from the majority voters were aggregated. 5. Translation into English: To address reviewer requirements and improve accessibility, the full dataset — including messages, labels, and justifications — was translated into two variants of English: British English (UK): following UK spelling conventions (e.g., recognise, offence, cancelled, behaviour) American English (US): following US spelling conventions (e.g., recognize, offense, canceled, behavior) Each translated version is provided as a separate CSV file with the same five-column structure as the original: Text, hateful_vote_count, non_hateful_vote_count, final_vote, Final_User_Explanation.

Institutions

Categories

Computer Science, Computational Linguistics, Social Media, Natural Language Processing, Machine Learning, Cameroon, Africa Culture, Sentiment Analysis

Funders

Licence