A Multilingual & Multimodal Text and Image Corpus Dataset for Political Misinformation

Published: 25 April 2025| Version 1 | DOI: 10.17632/x356jrj2cz.1
Contributors:
,
,
,
,

Description

Our database is a richly annotated multimodal database designed to facilitate strong fake-news detection research. It consists of two complementary but separate components: an image directory and a text spreadsheet. The image directory consists of a folder-level organization with a title as a topic; within each topic directory, the images are then placed in real and fake subdirectories based on the expert labeling. Such an organization allows loading and processing images for cross-modal testing or supervised learning. In contrast, text data are kept in a single Excel sheet where a record is one piece of news. Four separate columns keep the title, source, full news report, and real/fake indicator. Together, these modalities cover a broad range of temporal and topical domains not only social-media posts, mainstream-media news reports, and election-related posts but allowing the training of models on both linguistic aspects (sensational or objective tone, grammaticality, metadata quality) and visual aspects (original vs. photo-manipulated images). With a combination of a sparse folder hierarchy for images and a richly annotated spreadsheet for text, the dataset is well-specified, reproducible, and easy to pipe into any subsequent machine-learning pipeline.

Files

Institutions

  • Vishwakarma Institute of Information Technology

Categories

Politics, Natural Language Processing, Machine Learning, Multimodality, Convolutional Neural Network, Public Sentiment, Sentiment Analysis, Large Language Model, Multilingual LLM

Licence