DarNERcorp: a Named Entity Recognition Corpus in the Moroccan Dialect

Published: 2 May 2023| Version 4 | DOI: 10.17632/286sss4k9v.4
Contributors:
Hanane Nour Mousa, Asmaa Mourhir

Description

DarNERcorp is a manually annotated corpus for Named Entity Recognition (NER) in the Moroccan Dialect or Darija. The corpus contains more than 65K tokens, 13.8% of which are named entities. Named entities in the dataset are annotated with one of the following tags, using the BIO tagging scheme: person (PER), location (LOC), organization (ORG), miscellaneous (MISC). The distribution of named entities in the dataset is as follows: PER (15.3%), LOC (38.1%), ORG (15.5%), MISC (31.1%). The corpus is presented in the Data folder and it is split into two sets: DarNERcorp_train and DarNERcorp_test. The first set represents 80% of the data and the second represents 20%. In addition to the data, the Python scripts used in the collection and data formatting are provided in the Code folder.

Files

Institutions

Al Akhawayn University in Ifrane

Categories

Data Mining, Natural Language Processing, Information Extraction, Arabic Language

Licence