Dataset of Arabic Spam and Ham Tweets

Published: 22 January 2024| Version 3 | DOI: 10.17632/86x733xkb8.3
Sanaa Kaddoura, Safaa Henno


This paper is a descriptor for the dataset, cite it when you use the data: Kaddoura, Sanaa, and Safaa Henno. "Dataset of Arabic spam and ham tweets." Data in Brief 52 (2024): 109904. Paper Link: The data was analyzed in this article: Kaddoura, S., Alex, S. A., Itani, M., Henno, S., AlNashash, A., & Hemanth, D. J. (2023). Arabic spam tweets classification using deep learning. Neural Computing and Applications, 1-14. The data are collected from Twitter using Twitter API between January 27, 2021, and March 10, 2021. The download tweet information is Tweet ID, DateTime, URL, Tweet Text, User Name, Location, Replied Tweet ID, Replied Tweet User ID, Replied Tweet Username, Retweet Count, Favorite Count, and Favorited. The dataset contains two file. The first file is "Dataset of Arabic Spam and Ham Tweets.xlsx.": This file contains the original collected dataset. The dataset contains 13241 records. Each record represents a tweet. The tweets are labeled either Ham or Spam. Ham means non-spam tweet. There are 1924 Spam tweets and 11299 Ham tweets. The tweets are unique i.e. there are no repeated tweets records. The second file is "Augmented_SpamHamTweets.xlsx": on this dataset, contextual augmentation was applied to increase the number of the minority class which is the "spam" class. This file will help while applying machine learning to the dataset to get better and more reliable results. This dataset now contains 11030 ham tweets and 15128 spam tweets.



Zayed University


Computer Science, Cybersecurity, Data Science, Natural Language Processing, Machine Learning, Arabic Language


Zayed University

Start-up Grant [Grant Number R20081]