PolyglotFakeFacts: A multilingual dataset of fake and real news across politics, security, and social domains
Description
PolyglotFakeFacts is a multilingual dataset designed to support research on the detection of fake and real news across diverse domains such as politics, geopolitics, security, social issues, and military affairs. The research hypothesis underpinning this dataset is that linguistic and contextual markers of misinformation can be systematically identified across multiple languages, enabling the development of more robust and generalizable fake news detection models. The dataset shows a balanced collection of human-labeled fake news articles alongside verified real news extracted from trusted media outlets. Notably, the data covers multiple languages and different thematic areas, which allows researchers to explore how misinformation manifests in diverse cultural and geopolitical contexts. Among the key findings is that fake news articles often display recurring linguistic and structural patterns regardless of the language, while real news tends to follow more standardized journalistic conventions. This suggests that multilingual approaches to fake news detection could leverage both cross-linguistic similarities and domain-specific features. The data was gathered through a combination of manual annotation by human experts for fake news samples and curation of real news from reliable sources. All samples were pre-processed to ensure consistent formatting, removal of duplicates, and inclusion of metadata such as language, domain, and label (fake/real). This dataset can be interpreted and used by researchers aiming to: - train and evaluate machine learning and deep learning models for fake news classification, - perform cross-lingual and multilingual comparative studies, - investigate the linguistic, semantic, and thematic characteristics of misinformation. By providing a curated, multilingual, and domain-diverse resource, PolyglotFakeFacts enables the community to develop more transparent, explainable, and resilient AI models for combating online misinformation.
Files
Steps to reproduce
Access the dataset Download the PolyglotFakeFacts dataset from the public repository [DOI: 10.17632/gff8bmr4ff.1]. Data format - The dataset is provided in .xlsx format. - Each entry contains: gathering date, news date, language of the original news article, URL, web domain, keywords, news headline, news original text, english translated version, label - Environment setup Install Python ≥ 3.8 (or specify language/tool). Install required libraries: pandas, scikit-learn, numpy (add more if needed). Alternatively, open the dataset in any statistical or ML environment (e.g., R, MATLAB, SPSS). - Load the dataset import pandas as pd df = pd.read_csv("PolyglotFakeFacts.csv") print(df.head()) - Data preprocessing Remove duplicates (already minimized in the provided dataset). Tokenize and clean text (lowercasing, punctuation removal, stopword removal if needed). Handle multilingual data (e.g., use language-specific tokenizers). - Reproduce analysis Train classification models (e.g., logistic regression, SVM, transformers). Split dataset into train/test (suggested 80/20 split). Evaluate with accuracy, F1-score, and confusion matrix. - Interpret results Compare performance across languages and domains. Assess which features or linguistic patterns are most predictive of fake vs. real news.
Institutions
- Universitatea Politehnica din Bucuresti Facultatea de Electronica Telecomunicatii si Tehnologia InformatieiBucuresti