25000-UzNER-5Style Corpus
Description
As part of this study, an annotated Uzbek-language corpus was developed for training and evaluating named entity recognition (NER) models. The final corpus consists of 25,000 sentences and 285,736 tokens collected from different Uzbek-language sources and organized according to five functional text styles. The corpus was compiled from the following sources: • Legislative and official documents: A portion of the data was extracted from the publicly available Lex.uz database, which contains official and normative legal texts. • Mass media sources: To enrich the corpus with publicistic and news-style texts, materials were collected from the Kun.uz online news platform and transcripts of videos from the YouTube platform covering various social, political, educational, and cultural topics. • Literary sources: Excerpts from the novel “Kecha va kunduz” and other literary-style texts were selected to represent the literary style of the Uzbek language. • Scientific sources: Scientific-style texts were obtained from conference proceedings, academic collections, research papers, and educational materials. • Colloquial sources: Conversational-style sentences were collected from spoken-language materials, podcast transcripts, and informal communication contexts in order to reflect natural everyday Uzbek usage. • Synthetic data: In addition, quality-controlled synthetic sentences were included where necessary to preserve style balance, diversify entity occurrences, and improve the coverage of rare named entity types. The collected data were organized into five major functional styles: official, colloquial, scientific, publicistic, and literary. Each style contains 5,000 sentences, resulting in a balanced multi-style corpus. This structure ensures both thematic and stylistic diversity and makes the dataset suitable for evaluating NER models across different types of Uzbek texts. Data annotation was performed using the BIOES tagging scheme, which allows more precise identification of named entity boundaries compared with the standard BIO scheme. The corpus includes a wide range of entity categories, such as person names, organizations, geopolitical units, locations, dates, laws, numerical expressions, products, facilities, ranks, and other domain-specific entities. All annotated data were checked and validated to ensure consistency, correctness, and compliance with the annotation guidelines. The final version of the corpus was prepared as a balanced gold-standard dataset for training, testing, and comparing Uzbek-language NER models.
Files
Steps to reproduce
The dataset was created using a structured workflow that included data collection, preprocessing, annotation, validation, and evaluation. Uzbek-language texts were collected from official documents, news websites, scientific publications, literary works, spoken-language materials, and quality-controlled synthetic examples. The corpus was organized into five functional styles: official, colloquial, scientific, publicistic, and literary, with 5,000 sentences in each style. As a result, a balanced corpus of 25,000 sentences was formed. The collected texts were cleaned, normalized, segmented into sentences, and tokenized. During preprocessing, duplicate sentences, incomplete fragments, inconsistent apostrophes, and mixed Cyrillic-Latin characters were corrected. Named entities were annotated manually using the BIOES tagging scheme, which allows accurate identification of both entity boundaries and entity types. The annotated data were then reviewed and validated according to predefined annotation guidelines. The final corpus was prepared in spreadsheet and CSV formats. For experimental evaluation, the dataset was split into 20,000 training sentences and 5,000 testing sentences while preserving the balance of the five styles. Baseline experiments were conducted using Gazetteer, CRF, and a context-based lookup algorithm. The models were evaluated using entity-level precision, recall, and F1-score.