PolyglotFakeFacts: A multilingual dataset of fake and real news across politics, security, and social domains_v2.0

Published: 17 February 2026| Version 1 | DOI: 10.17632/8yfrm6z9dx.1
Contributor:
Alexandru Ciobanu

Description

PolyglotFakeFacts v2.0 is an updated and expanded version of the original dataset (V1, DOI: 10.17632/gff8bmr4ff.1), published in February 2026. This version incorporates newly obtained data, providing an enriched and more comprehensive multilingual resource for fake and real news detection research. The dataset now comprises 10,206 articles in total — 4,912 fake news entries and 5,294 real/true news entries — collected from online sources across 18 languages: Arabic, Armenian, Azerbaijani, Bulgarian, Czech, English, Finnish, French, Georgian, German, Hungarian, Italian, Lithuanian, Romanian, Russian, Slovak, Spanish, and Swedish. Each article is available both in its original language and in an English translated version. To ensure balance, real news was collected from sources covering each of the 18 languages represented in the fake news subset. PolyglotFakeFacts is a multilingual dataset designed to support research on the detection of fake and real news across diverse domains such as politics, geopolitics, security, social issues, and military affairs. Fake news articles were sourced from outlets flagged by EUvsDisinfo — the flagship project of the East StratCom Task Force within the EEAS (European External Action Service) — while real news was curated from official and editorially credible sources in each represented language. The research hypothesis underpinning this dataset is that linguistic and contextual markers of misinformation can be systematically identified across multiple languages, enabling the development of more robust and generalizable fake news detection models. Among the key findings is that fake news articles often display recurring linguistic and structural patterns regardless of the language, while real news tends to follow more standardized journalistic conventions. This suggests that multilingual approaches to fake news detection could leverage both cross-linguistic similarities and domain-specific features. Each entry in the dataset is structured around ten fields: gathering date, news date, URL, source name, language, keywords, original headline, original text, English-translated text, and label (fake/non-fake). All samples were pre-processed to ensure consistent formatting and removal of duplicates. This dataset can be interpreted and used by researchers aiming to: train and evaluate machine learning and deep learning models for fake news classification, perform cross-lingual and multilingual comparative studies, and investigate the linguistic, semantic, and thematic characteristics of misinformation. By providing a curated, multilingual, and domain-diverse resource, PolyglotFakeFacts enables the community to develop more transparent, explainable, and resilient AI models for combating online misinformation.

Files

Steps to reproduce

The dataset was constructed through a two-step data collection process, separately addressing fake news and real news articles.For the fake news subset, articles were collected from online sources identified as unreliable by the EUvsDisinfo team — the flagship project of the East StratCom Task Force within the EEAS (European External Action Service), specialized in identifying disinformation campaigns targeting the European Union, its member states, and neighboring countries. Their methodology involves monitoring news services in multiple languages, compiling disinformation cases, and exposing them on their dedicated portal alongside a short summary and a disproof. During collection, the following criteria were applied: online availability of the article, presence of both the headline and the full text, and availability of the publication date.For the real news subset, articles were curated from official websites considered trustworthy in their respective countries, covering the same thematic domains as the fake news subset (social, political, military, economic, and security). The multilingual criterion was strictly observed, ensuring that real news articles are available in the same 18 languages as the fake news entries: Arabic, Armenian, Azerbaijani, Bulgarian, Czech, English, Finnish, French, Georgian, German, Hungarian, Italian, Lithuanian, Romanian, Russian, Slovak, Spanish, and Swedish.All collected articles underwent a pre-processing pipeline that included removal of duplicate entries, normalization of text formatting, and translation of each article into English (with the original language version preserved alongside the English translation). Each entry was annotated with the following ten metadata fields: gathering date, news date, URL, source name, language, keywords, original headline, original text, English-translated text, and label (fake/non-fake).The resulting dataset consists of 4,912 fake news articles and 5,294 real news articles, for a combined total of 10,206 entries across 18 languages.

Categories

Computer Science, Artificial Intelligence, Data Mining, Natural Language Processing, Multilingualism, Propaganda

Licence