Annotated Tweet Datasets for Cyberattack Relevance
Cybersecurity Relevant Term List was compiled with the goal of creating a comprehensive and representative set of cybersecurity-related terms commonly used in the industry to support the analysis of cybersecurity-related data. The term list consists of 269 phrases that were manually analyzed for the tweets containing the term "exploit." We created a smaller dataset of 2000 sample tweets from the raw tweet dataset. This curated dataset, called the "Cyberattack Annotated Sample Dataset", is kept in an Excel file and the columns are listed as follows: "ID", "Cyberattack Relevance", and "Rule-Based CA Relevance". Finally, we annotated the original dataset and saved it as a CSV file. Of the 3,304,090 tweets, 1,984,523 received the label 'CA-negative', while 1,319,567 received the label 'CA-positive'.