Dataset for Sentiment and Named Entity Analysis in Uzbek Texts

Published: 23 January 2026| Version 3 | DOI: 10.17632/y2d5pcyrzz.3
Contributor:
Bobur Saidov

Description

This dataset contains 15,000 synthetically generated Uzbek sentences annotated for sentiment (positive/neutral/negative) and named entities in three categories: PER, ORG, and LOC. It includes two subsets: Hybrid Synthetic Corpus (12,000) generated via templates with lexical polarity resources and curated NER gazetteers, and Manual-Style Synthetic Corpus (3,000) created using short natural-style patterns with higher emoji frequency to reflect conversational usage. Each record provides: id, text, sentiment, entities (JSON), entity_type (JSON aligned with entities), polarity_score, polarity_source, token_count, and emojis (JSON). Emoji presence is ~30% in the hybrid subset and ~39% in the manual-style subset, with emojis grouped into positive/neutral/negative classes. The dataset is released in CSV and XLSX (UTF-8) and distributed under CC BY 4.0.

Files

Steps to reproduce

Steps to reproduce Environment Python 3.10+ recommended. Install dependencies: pip install pandas numpy openpyxl Download files Download the dataset package and extract it. The package contains: uz_synthetic_sentiment_ner_12000_v3_final.(csv|xlsx|jsonl) uz_manual_style_dataset_3000_v3_final.(csv|xlsx|jsonl) schema_v3.json, DATA_DICTIONARY_v3.(xlsx) validate_dataset.py (validation script), README_v3.md Validate the release Run: python validate_dataset.py The script checks: schema completeness, allowed values, JSON validity for list-typed fields (entities, entity_type, emojis), length consistency between entities and entity_type, and emoji metadata consistency (e.g., emoji_position="none" when no emoji is present). Reproduce data generation (high-level workflow) a) Load curated resources (polarity lexicon + NER gazetteers for PER/ORG/LOC). b) Generate Hybrid Synthetic Corpus (12,000) using templates with controlled sentiment balancing and entity insertion. c) Generate Manual-Style Corpus (3,000) using short natural-style patterns with higher emoji frequency. d) Compute polarity_score and assign sentiment using lexicon-based scoring + rule-based modifiers (negation/intensifiers). e) Export to CSV/XLSX/JSONL following schema_v3.json. Quick sanity checks Confirm total size: 12,000 + 3,000 records. Verify label set: sentiment ∈ {positive, neutral, negative}, entity labels PER/ORG/LOC. Spot-check a small sample to confirm entity spans/labels and emoji fields are consistent.

Categories

Social Sciences, Linguistics, Computer Science, Software Engineering, Computational Linguistics, Data Science

Licence