A Multilingual Dataset of YouTube Comments on Global Conflicts (2024–2025) Across 138 Languages
Description
This dataset contains 79,051 YouTube comments and replies collected in 2025 across seven global conflict topics including the Israel-Hamas-Palestine conflict and the Ukraine-Russia war. The corpus spans 138 detected languages, with English (55,567), Turkish (4,026), Russian (2,916), and Arabic (395) being the most represented. Data was collected using YouTube Data API and Selenium-based scraping from publicly available YouTube videos. The dataset includes both top-level comments and nested replies, along with relative timestamps, conflict topic labels, and automatically detected language tags. Author usernames have been removed to comply with privacy regulations (KVKK/GDPR). This corpus is intended as an open resource for future multilingual conflict sentiment analysis, misinformation detection, and cross-cultural public opinion research.