UzNER-5Style Corpus
Description
As part of the study, an annotated corpus of the Uzbek language was created for training and evaluating named entity recognition (NER) models. The corpus consists of 5,000 sentences (65,608 words) collected from various sources. The data were compiled from the following sources: • Legislative and official documents: Part of the data was extracted from the publicly available lex.uz database, which contains official and normative legal texts. • Mass media sources: To enrich the corpus, materials were collected from the online news platform kun.uz and transcripts of videos from the youtube.com platform covering various topics. • Literary sources: Excerpts from the novel “Kecha va kunduz” were selected to represent the literary style. • Scientific sources: Texts in the scientific style were obtained from conference proceedings and academic collections. • Synthetic data: In addition, synthetic sentences in the scientific style were generated to further diversify the corpus. The collected data were organized according to five major functional styles: colloquial, official, scientific, literary, and publicistic. This approach ensured thematic and stylistic diversity of the corpus and enhanced the effectiveness of model training. Data annotation was performed manually using the BIOES tagging scheme, which enables precise identification of the boundaries and types of named entities. All annotated data were reviewed and validated by Uzbek language experts to ensure accuracy and consistency.