Multi-Fake-DetectiVE
Description
The dataset includes social media posts and news articles, containing both a textual and a visual component, concerning the Ukrainian-Russian war started in February 2022. The dataset was collected to perform two distinct sub-tasks: Multimodal Fake News Detection, and Cross-modal Relation Classification in fake and real news. Given a piece of content (e.g., a social media post or a news article) that includes both a visual and a textual component, the first sub-task aims to detect if the content is a real or a fake news. The second sub-task aims to understand how the visual and textual components of news can influence each other. Given a text and an accompanying image, the sub-task intends to determine whether the combination of the two aims to mislead the interpretation of the reader about one or the other, or not. The data to be used for the two sub-tasks are stored in two separate sub-folders. Each sub-folder includes: (i) a training set, which contains data collected from February 2022 to September 2022, (ii) a contemporary test set, which includes data collected in the same time window as the training set, and (iii) a future test set, which contains data collected in a subsequent time window, specifically from October 2022 to December 2022.
Files
Steps to reproduce
The dataset was collected by Twitter APIs and then annotated via crowdsourcing. First, 920.054 tweets and 128.611 news articles were downloaded from Twitter in a time span from February to December 2022, using keywords related to the Russo-Ukrainian war. Then, a manually collected set of already verified fake news and misleading claims were exploited to extract from this data a number of news which were likely to be fake. Second, a human annotation process was performed through Prolific. For each sub-task, five annotators were provided with the verified fake news as context and asked to label a few news. Only those instances for which at least three out of the five annotators provided the same label were kept.