DarNERcorp: a Named Entity Recognition Corpus in the Moroccan Dialect
Published: 16 March 2023| Version 2 | DOI: 10.17632/286sss4k9v.2
Contributors:
Hanane Nour Mousa,
Asmaa Mourhir
Description
DarNERcorp is a manually annotated corpus for Named Entity Recognition (NER) in the Moroccan Dialect or Darija. The corpus contains more than 65K tokens, 13.8% of which are named entities. Named entities in the dataset are annotated with one of the following tags, using the BIO tagging scheme: person (PER), location (LOC), organization (ORG), miscellaneous (MISC). The distribution of named entities in the dataset is as follows: PER (15.3%), LOC (38.1%), ORG (15.5%), MISC (31.1%).
Files
Institutions
Al Akhawayn University in Ifrane
Categories
Data Mining, Natural Language Processing, Information Extraction, Arabic Language