DarNERcorp: a Named Entity Recognition Corpus in the Moroccan Dialect
Description
DarNERcorp is a manually annotated corpus for Named Entity Recognition (NER) in the Moroccan Dialect or Darija. The corpus contains more than 65K tokens, 13.8% of which are named entities. Named entities in the dataset are annotated with one of the following tags, using the BIO tagging scheme: person (PER), location (LOC), organization (ORG), miscellaneous (MISC). The distribution of named entities in the dataset is as follows: PER (15.3%), LOC (38.1%), ORG (15.5%), MISC (31.1%). The corpus is presented in the Data folder and it is split into two sets: DarNERcorp_train and DarNERcorp_test. The first set represents 80% of the data and the second represents 20%. In addition to the data, the Python scripts used in the collection and data formatting are provided in the Code folder.