Moroccan Darija Offensive Language Detection Dataset

Published: 24 October 2023| Version 2 | DOI: 10.17632/2y4m97b7dc.2
Anass Ibrahimi, Asmaa Mourhir


The Moroccan Darija offensive language detection dataset is a human-labeled dataset consisting of a set of Moroccan Darija sentences for offensive language detection. The dataset contains 20,402 sentences and their corresponding binary labels: 0 for a non-offensive sentence and 1 for an offensive sentence. The sentences were gathered from Twitter and YouTube comments and are written in both Latin and Arabic scripts. Inoffensive sentences account for 62.2% (12,685 sentences), while offensive sentences account for 37.8% (7,717 sentences). This contribution addresses the scarcity of labeled datasets for Moroccan Darija and provides a resource for natural language processing researchers interested in Moroccan Darija, particularly offensive language and sentiment analysis tasks. Note: It is important to note that this dataset contains profanity and offensive sentences. Users are advised to exercise caution when exploring the dataset, especially in educational or sensitive contexts. We recommend that researchers and users consider the potential presence of offensive language when using this resource for natural language processing tasks, particularly for offensive language and sentiment analysis. Please be aware of the dataset's content before accessing it.



Al Akhawayn University in Ifrane


Natural Language Processing, Arabic Language, Sentiment Analysis