A Multilingual Dataset of YouTube Comments on Global Conflicts (2024–2025) Across 138 Languages

Name: A Multilingual Dataset of YouTube Comments on Global Conflicts (2024–2025) Across 138 Languages
Creator: Umair Ali Khan
Published: 2026-06-01T15:08:51.887Z
Keywords: Social Media, Natural Language Processing, Machine Learning, Deep Learning, Sentiment Analysis, Large Language Model

Ali Khan, Umair

doi:10.17632/bms263ms68.1

A Multilingual Dataset of YouTube Comments on Global Conflicts (2024–2025) Across 138 Languages

Published: 1 June 2026| Version 1 | DOI: 10.17632/bms263ms68.1

Contributor:

Umair Ali Khan

Description

This dataset contains 79,051 YouTube comments and replies collected in 2025 across seven global conflict topics including the Israel-Hamas-Palestine conflict and the Ukraine-Russia war. The corpus spans 138 detected languages, with English (55,567), Turkish (4,026), Russian (2,916), and Arabic (395) being the most represented. Data was collected using YouTube Data API and Selenium-based scraping from publicly available YouTube videos. The dataset includes both top-level comments and nested replies, along with relative timestamps, conflict topic labels, and automatically detected language tags. Author usernames have been removed to comply with privacy regulations (KVKK/GDPR). This corpus is intended as an open resource for future multilingual conflict sentiment analysis, misinformation detection, and cross-cultural public opinion research.

A Multilingual Dataset of YouTube Comments on Global Conflicts (2024–2025) Across 138 Languages

Description

Files

Categories

Licence