Dataset of Named Entity Recognition for Uzbek language

Published: 22 October 2024| Version 1 | DOI: 10.17632/xf7pyvhb2v.1
Contributor:
Davlatyor Mengliev

Description

As part of the study, an annotated corpus of the Uzbek language was created for training and evaluating named entity recognition models. The corpus includes 2,000 sentences (25865 words) collected from various sources: • Certain part of the data (Sentences from 1 to 154 in the Dataset) was extracted from the publicly available lex.uz database, which contains official texts that are highly literate and have a formal language structure. • To increase the number of named entities in sentences and ensure diversity, author's sentences were developed containing several entities of different types. This enriched the corpus with complex structures and increased the efficiency of model training. Data annotation was carried out manually using the BIOES scheme, which provides detailed marking of boundaries and types of named entities. All abstracts were reviewed by Uzbek language experts to ensure accuracy and consistency of data.

Files

Categories

Natural Language Processing, Uzbekistan, Database, Language

Licence