UzThemeLex Dataset: An Uzbek Thematic Lexicon for Domain Terminology and Weakly Supervised NER
Description
UzThemeLex is a curated Uzbek-language thematic lexicon dataset designed for domain terminology mining and weakly supervised named entity recognition (NER). The release contains 4,945 unique terminological entries organized into 3 top-level domains (Agronomy, Economics and Business, Law and Governance) and 30 subcategories. Each entry provides the Uzbek term in Latin script, a normalized form for matching, a paraphrased Uzbek definition, domain and subcategory labels, provenance pointers to authoritative sources, and lightweight quality-control signals (heuristic confidence score, review flag, ambiguity flag). Optional fields include aliases and example sentences. The dataset is distributed in multiple formats to support both manual inspection and machine processing. It includes a flat CSV file and a multi-sheet Excel workbook, together with a data dictionary that documents all columns and label sets. For training and pipeline integration, the release also provides JSON/JSONL exports, taxonomy metadata, and ready-to-use pattern files for dictionary-based tagging and weak supervision (e.g., spaCy EntityRuler patterns). A validation script is included to help users verify schema consistency and detect formatting issues (e.g., residual Cyrillic characters and apostrophe normalization). UzThemeLex can be used as (i) a domain dictionary for keyword-based classification and information extraction in Uzbek texts and (ii) a gazetteer for generating weak labels to train or fine-tune NER models. The resource is intended to support Uzbek NLP research and applied text analytics in agriculture, economics, and legal/governance domains.
Files
Steps to reproduce
Download the files from this repository. Open UzThemeLex_Combined_4945_v3_1_clean.csv (UTF-8 with BOM) or the Excel workbook UzThemeLex_Combined_4945_v3_1_clean.xlsx. Use UzThemeLex_data_dictionary_v3_1.xlsx to interpret column meanings, label sets, and required/optional fields. (Optional) Run the validation script: python validate_uzthemelex_v3_1.py to check schema consistency, label validity, and transliteration/normalization constraints. (Optional) For dictionary-based tagging or weak supervision, load the spaCy EntityRuler patterns: spacy_entityruler_patterns_domain.jsonl (3-domain labels) or spacy_entityruler_patterns_subcat.jsonl (30-subcategory labels).
Institutions
- Novosibirsk State UniversityNovosibirsk Oblast, Novosibirsk
- Urgench State UniversityXorazm Region, Urgench