DarNERcorp: a Named Entity Recognition Corpus in the Moroccan Dialect

Published: 16 March 2023| Version 2 | DOI: 10.17632/286sss4k9v.2
Contributors:
Hanane Nour Mousa,
Asmaa Mourhir

Description

DarNERcorp is a manually annotated corpus for Named Entity Recognition (NER) in the Moroccan Dialect or Darija. The corpus contains more than 65K tokens, 13.8% of which are named entities. Named entities in the dataset are annotated with one of the following tags, using the BIO tagging scheme: person (PER), location (LOC), organization (ORG), miscellaneous (MISC). The distribution of named entities in the dataset is as follows: PER (15.3%), LOC (38.1%), ORG (15.5%), MISC (31.1%).

Files

Institutions

Al Akhawayn University in Ifrane

Categories

Data Mining, Natural Language Processing, Information Extraction, Arabic Language

License