Annotated Tweet Datasets for Cyberattack Relevance

Published: 10 November 2023| Version 2 | DOI: 10.17632/6tx3ndy6d9.2
Contributor:
Hyuk-Yoon Kwon

Description

We created a smaller dataset of 2000 sample tweets from the raw tweet dataset. This curated dataset, called "Cyberattack Annotated Sample Dataset", is kept in an Excel file and the columns are listed as follows: "Username", "ID", "Tweet", "Date", "Cyberattack Relevance", and "Rule-Based CA Relevance". The Cyberattack Relevance column has two values: CA-positive and CA-negative. While CA-positive means that the tweet is related to cyberattacks, CA-negative means that the tweet is not related to cyberattacks. The values for this column were manually assigned using Cybersecurity Relevant Term List. The number of tweets labeled "CA-positive" is 528 and the number of tweets labeled "CA-negative" is 1472. There is also another column called "Rule-Based CA Relevance", which has the same two values: CA-positive and CA-negative. We note that the "Cyberattack Relevance" column has labels based on manual annotation. Finally, by applying the semi-supervised annotation based on Cyberattack Annotated Sample Dataset to the original full datasets, we annotated them and saved it as Excel files, Cyberattack Annotated Full Dataset. Of the 8,577,713 tweets, 7,017,392 received the label 'CA-negative', while 1,560,321 received the label 'CA-positive.'

Files

Categories

Annotation, Cyber Attack

Licence