Health-related Spam Campaigns

Published: 15 December 2020| Version 2 | DOI: 10.17632/rgrvt5x4tk.2


The dataset was collected from Arabic trending hashtags by using Twitter’s standard search Application Programming Interference (API) for the period from May 2018 to November 2020. The datasets consists of 3000 tweets; 2500 training and 500 testing dataset. The number of features extracted for each tweet is 14, and two labels spam (1) and non-spam (2). The datasets were annotated by three annotators and the Fleiss’s Kappa achived was 0.96. One of the annotators is a PhD student specialist in pragmatics and she has experience in Twitter data annotation. The second annotator is a Computer Science student who has worked on annotating Twitter datasets. The tweets in the dataset were shuffled and our labels were removed. The datasets were collected to analysis the charachtrestics of an on health-related advertisment campagins on Twitter Arabic hashtags. One of the notable findings is that some spam tweets were found to be posted by old accounts that has only a few tweets, unlike regular accounts, where the number of tweets increases as the accounts get older. To prove this hypothesis, we used the Spearman’s rank test to measure the strength of the correlation between account age and number of posts (status). The results of the Spearman’s test showe that there is a positive correlation between the two features for non-spam accounts. Thus, a new feature was designed avg_posts to improve the detection of spam tweets. These special type of spam tweets was found to be hijacked accounts. However, these hijacked accountss are hard to be detected by users' behavioural-based detectors as they do not have enough number of posts to be analysed. Finally, the datasets were used to build ML-based detector, which outpreforms users' behavioural-based detectors (e.g., COMPA and Nautua).


Steps to reproduce

The dataset was collected from Arabic trending hashtags by using tweepy python library. The code was impleminted using jupyter notebook. The codes for the project can be found in ( annotators and I label the datasets. The guidelines provided to the annotators are as follows: 1. spam: tweets that advertise healthcare products, such as Diet, weight loss, skin, sex, hair, etc. 2. non-spam: any tweets that do not advertise healthcare products.


University of York


Social Media, Machine Learning, Twitter