A Balanced Dataset for Spam and Smishing Detection using Large Language Models (LLMs)

Published: 4 July 2025| Version 1 | DOI: 10.17632/vmg875v4xs.1
Contributors:
, Muhammad Islam

Description

This dataset contains 10,191 labeled SMS messages for training and testing spam and smishing detection machine learning models. A large language model (LLM) was trained to create this dataset. Structure This dataset contains five columns: • LABEL: A categorical value indicating the type of message. The values are: o Ham: Benign (non-malicious) message o Spam: Unsolicited or junk message o Smishing: SMS phishing message to deceive recipients into giving away their sensitive personal information • TEXT: The content of the message • URL: Indicates whether a URL is present in the message (Yes/No) • EMAIL: Indicates whether an email address is present in the message (Yes/No) • PHONE: Indicates whether a phone number is present in the message (Yes/No) Key Features The dataset is balanced to prevent bias in classification tasks: • ham: 3,397 messages • spam: 3,397 messages • smishing: 3,397 messages Source and Citation The following publicly available dataset is used for training of the LLM: Mishra, Sandhya; Soni, Devpriya (2022), “SMS PHISHING DATASET FOR MACHINE LEARNING AND PATTERN RECOGNITION”, Mendeley Data, V1, doi: 10.17632/f45bkkt8pr.1 Use Cases • Text classification research • Phishing and fraud detection models • LLM fine-tuning or prompt engineering for safety and content moderation • Educational demonstrations in cybersecurity, machine learning (ML) or natural language processing (NLP)

Files

Steps to reproduce

1. Create a ham, spam, or smishing SMS message using a trained Large Language Model (LLM) or alternatively, collect real examples. 2. Verify and assign the correct label (Ham, Spam, or Smishing) to each message in the LABEL column. 3. Save the labeled messages in a CSV file named Dataset_10191.csv with columns: LABEL and TEXT. 4. Run the Python script to detect the presence of email addresses, URLs, and phone numbers based on regular expressions for each message. 5. Review the generated file ( for example, Dataset_10191_Reproduced_YYYYMMDD.csv), which will include the original data plus URL, EMAIL, and PHONE columns.

Institutions

George Washington University

Categories

Cybersecurity, Machine Learning, Mobile Device, e-Mail, Large Language Model

Licence