DarNERcorp: a Named Entity Recognition Corpus in the Moroccan Dialect

Name: DarNERcorp: a Named Entity Recognition Corpus in the Moroccan Dialect
Creator: Hanane Nour Mousa
Published: 2023-05-02T06:56:26.993Z
Keywords: Data Mining, Natural Language Processing, Information Extraction, Arabic Language

Mousa, Hanane Nour; Mourhir, Asmaa

doi:10.17632/286sss4k9v.4

DarNERcorp: a Named Entity Recognition Corpus in the Moroccan Dialect

Published: 2 May 2023| Version 4 | DOI: 10.17632/286sss4k9v.4

Contributors:

Hanane Nour Mousa, Asmaa Mourhir

Description

DarNERcorp is a manually annotated corpus for Named Entity Recognition (NER) in the Moroccan Dialect or Darija. The corpus contains more than 65K tokens, 13.8% of which are named entities. Named entities in the dataset are annotated with one of the following tags, using the BIO tagging scheme: person (PER), location (LOC), organization (ORG), miscellaneous (MISC). The distribution of named entities in the dataset is as follows: PER (15.3%), LOC (38.1%), ORG (15.5%), MISC (31.1%). The corpus is presented in the Data folder and it is split into two sets: DarNERcorp_train and DarNERcorp_test. The first set represents 80% of the data and the second represents 20%. In addition to the data, the Python scripts used in the collection and data formatting are provided in the Code folder.

Files

Institutions

Al Akhawayn University in Ifrane

DarNERcorp: a Named Entity Recognition Corpus in the Moroccan Dialect

Description

Files

Institutions

Categories

Licence