Moroccan Darija Offensive Language Detection Dataset

Published: 20 September 2023| Version 1 | DOI: 10.17632/2y4m97b7dc.1
Contributors:
Anass Ibrahimi, Asmaa Mourhir

Description

The Moroccan Darija offensive language detection dataset is a human-labeled dataset consisting of a set of Moroccan Darija sentences for offensive language detection. The dataset contains 20,402 sentences and their corresponding binary labels: 0 for a non-offensive sentence and 1 for an offensive sentence. The sentences were gathered from Twitter and YouTube comments and are written in both Latin and Arabic scripts. Inoffensive sentences account for 62.2% (12,685 sentences), while offensive sentences account for 37.8% (7,717 sentences). This contribution addresses the scarcity of labeled datasets for Moroccan Darija and provides a resource for natural language processing researchers interested in Moroccan Darija, particularly offensive language and sentiment analysis tasks.

Files

Institutions

Al Akhawayn University in Ifrane

Categories

Natural Language Processing, Arabic Language, Sentiment Analysis

Licence