COVID19 Rumor Detection

Published: 6 December 2022| Version 1 | DOI: 10.17632/xcz8vb448y.1
Contributors:
,
,
Matthias Bogaert,
,

Description

The data set contains information about the COVID-19 pandemic. Twitter data has been collected based on the hashtags #CoronaOutbreak, #CoronaVirus, #CoronaVirusOutbreak, #COVID19, #COVID-19, #COVID2019, and #SARSCoV2, between February 12, 2020 and June 15, 2020. The goal of this data set is to detect whether a tweet is identified as a rumor or not (given by the 'label' column). A tweet that is identified as a rumor is labeled as 1, and 0 otherwise. The tweets were labeled by two independent annotators using the following guidelines. Whether a tweet is a rumor or not depends on 3 important aspects: (1) A rumor is a piece of information that is unverified or not confirmed by official instances. In other words, it does not matter whether the information turns out to be true or false in the future. (2) More specifically, a tweet is a rumor if the information is unverified at the time of posting. (3) For a tweet to be a rumor, it should contain an assertion, meaning the author of tweet commits to the truth of the message. In sum, the annotators indicated that a tweet is a rumor if it consisted of an assertion giving information that is unverifiable at the time of posting. Practically, to check whether the information in a tweet was verified or confirmed by official instances at the moment of tweeting, the annotators used BBC News and Reuters. After all the tweets were labeled, the annotators re-iterated over the tweets they disagreed on to produce the final tweet label. Besides the label indicating whether a tweet is a rumor or not (i.e., ‘label’), the data set contains the tweet itself (i.e., ‘full_text’), and additional metadata (e.g., ‘created_at’, ‘favorite_count’) . In total, the data set contains 4,612 observations of which 485 (11%) are identified as rumors. This data set can be used by researchers to make rumor detection models (i.e., statistical, machine learning and deep learning models) using both unstructured (i.e., textual) and structured data.

Files

Institutions

Universiteit Gent

Categories

Social Media

Funding

Bijzonder Onderzoeksfonds UGent

BOF/STA/202009/001

Fonds Wetenschappelijk Onderzoek

12ZM923N

License