Dataset for Sentiment and Named Entity Analysis in Uzbek Texts

Name: Dataset for Sentiment and Named Entity Analysis in Uzbek Texts
Creator: Bobur Saidov
Published: 2026-01-23T19:49:22.275Z
Keywords: Social Sciences, Linguistics, Computer Science, Software Engineering, Computational Linguistics, Data Science

Saidov, Bobur

doi:10.17632/y2d5pcyrzz.3

Dataset for Sentiment and Named Entity Analysis in Uzbek Texts

Published: 23 January 2026| Version 3 | DOI: 10.17632/y2d5pcyrzz.3

Contributor:

Bobur Saidov

Description

This dataset contains 15,000 synthetically generated Uzbek sentences annotated for sentiment (positive/neutral/negative) and named entities in three categories: PER, ORG, and LOC. It includes two subsets: Hybrid Synthetic Corpus (12,000) generated via templates with lexical polarity resources and curated NER gazetteers, and Manual-Style Synthetic Corpus (3,000) created using short natural-style patterns with higher emoji frequency to reflect conversational usage. Each record provides: id, text, sentiment, entities (JSON), entity_type (JSON aligned with entities), polarity_score, polarity_source, token_count, and emojis (JSON). Emoji presence is ~30% in the hybrid subset and ~39% in the manual-style subset, with emojis grouped into positive/neutral/negative classes. The dataset is released in CSV and XLSX (UTF-8) and distributed under CC BY 4.0.

Files

Steps to reproduce

Steps to reproduce Environment Python 3.10+ recommended. Install dependencies: pip install pandas numpy openpyxl Download files Download the dataset package and extract it. The package contains: uz_synthetic_sentiment_ner_12000_v3_final.(csv|xlsx|jsonl) uz_manual_style_dataset_3000_v3_final.(csv|xlsx|jsonl) schema_v3.json, DATA_DICTIONARY_v3.(xlsx) validate_dataset.py (validation script), README_v3.md Validate the release Run: python validate_dataset.py The script checks: schema completeness, allowed values, JSON validity for list-typed fields (entities, entity_type, emojis), length consistency between entities and entity_type, and emoji metadata consistency (e.g., emoji_position="none" when no emoji is present). Reproduce data generation (high-level workflow) a) Load curated resources (polarity lexicon + NER gazetteers for PER/ORG/LOC). b) Generate Hybrid Synthetic Corpus (12,000) using templates with controlled sentiment balancing and entity insertion. c) Generate Manual-Style Corpus (3,000) using short natural-style patterns with higher emoji frequency. d) Compute polarity_score and assign sentiment using lexicon-based scoring + rule-based modifiers (negation/intensifiers). e) Export to CSV/XLSX/JSONL following schema_v3.json. Quick sanity checks Confirm total size: 12,000 + 3,000 records. Verify label set: sentiment ∈ {positive, neutral, negative}, entity labels PER/ORG/LOC. Spot-check a small sample to confirm entity spans/labels and emoji fields are consistent.

Institutions

Al Xorazmiy nomidagi Urganch davlat universiteti
Urganch
Novosibirskij gosudarstvennyj universitet
Novosibirsk

Dataset for Sentiment and Named Entity Analysis in Uzbek Texts

Description

Files

Steps to reproduce

Institutions

Categories

Licence