DarNERcorp: a Named Entity Recognition Corpus in the Moroccan Dialect

Published: 2 May 2023| Version 4 | DOI: 10.17632/286sss4k9v.4
Contributors:
Hanane Nour Mousa, Asmaa Mourhir

Description

DarNERcorp is a manually annotated corpus for Named Entity Recognition (NER) in the Moroccan Dialect or Darija. The corpus contains more than 65K tokens, 13.8% of which are named entities. Named entities in the dataset are annotated with one of the following tags, using the BIO tagging scheme: person (PER), location (LOC), organization (ORG), miscellaneous (MISC). The distribution of named entities in the dataset is as follows: PER (15.3%), LOC (38.1%), ORG (15.5%), MISC (31.1%). The corpus is presented in the Data folder and it is split into two sets: DarNERcorp_train and DarNERcorp_test. The first set represents 80% of the data and the second represents 20%. In addition to the data, the Python scripts used in the collection and data formatting are provided in the Code folder.

Files

Institutions

  • Al Akhawayn University in Ifrane

Categories

Data Mining, Natural Language Processing, Information Extraction, Arabic Language

Licence