Dataset of Arabic Spam and Ham Tweets

Published: 22 January 2024| Version 3 | DOI: 10.17632/86x733xkb8.3
Contributors:
Sanaa Kaddoura, Safaa Henno

Description

This paper is a descriptor for the dataset, cite it when you use the data: Kaddoura, Sanaa, and Safaa Henno. "Dataset of Arabic spam and ham tweets." Data in Brief 52 (2024): 109904. Paper Link: https://www.sciencedirect.com/science/article/pii/S2352340923009472 The data was analyzed in this article: Kaddoura, S., Alex, S. A., Itani, M., Henno, S., AlNashash, A., & Hemanth, D. J. (2023). Arabic spam tweets classification using deep learning. Neural Computing and Applications, 1-14. The data are collected from Twitter using Twitter API between January 27, 2021, and March 10, 2021. The download tweet information is Tweet ID, DateTime, URL, Tweet Text, User Name, Location, Replied Tweet ID, Replied Tweet User ID, Replied Tweet Username, Retweet Count, Favorite Count, and Favorited. The dataset contains two file. The first file is "Dataset of Arabic Spam and Ham Tweets.xlsx.": This file contains the original collected dataset. The dataset contains 13241 records. Each record represents a tweet. The tweets are labeled either Ham or Spam. Ham means non-spam tweet. There are 1924 Spam tweets and 11299 Ham tweets. The tweets are unique i.e. there are no repeated tweets records. The second file is "Augmented_SpamHamTweets.xlsx": on this dataset, contextual augmentation was applied to increase the number of the minority class which is the "spam" class. This file will help while applying machine learning to the dataset to get better and more reliable results. This dataset now contains 11030 ham tweets and 15128 spam tweets.

Files

Institutions

Zayed University

Categories

Computer Science, Cybersecurity, Data Science, Natural Language Processing, Machine Learning, Arabic Language

Funding

Zayed University

Start-up Grant [Grant Number R20081]

Licence