SMS PHISHING DATASET FOR MACHINE LEARNING AND PATTERN RECOGNITION

Published: 20 June 2022| Version 1 | DOI: 10.17632/f45bkkt8pr.1
Contributors:
sandhya mishra,

Description

The dataset is a set of labelled text messages that have been collected for SMS Phishing research. It has 5971 text messages labeled as Legitimate (Ham) or Spam or Smishing. It includes 489 spam messages, 638 smishing messages, and 4844 ham messages. This dataset contains raw message content that can be used as labelled data in Deep Learning or for extracting further attributes. The dataset contains extracted attributes from malicious messages that can be used for Classification of messages as malicious or legitimate. This dataset also includes python code that are used for extracting attributes. The data has been collected by converting the images obtained from the Internet to text using Python code. Attributes have been selected based on their relevance. The details of dataset attributes are given below: LABEL- Classification label categorizing the message as ham, spam, or Smishing TEXT- The raw content of the message. URL- Gives out whether the message contains a URL or not. EMAIL- Gives out whether the message contains an email id or not. PHONE - Gives out whether the message contains a phone number or not. Python code for extraction of the above listed dataset attributes is attached. The snapshot of this dataset is also attached. Frequency chart of the attributes are also attached.

Files

Steps to reproduce

1. Browse Internet using Google Chrome. 2. Extract text data, images and screenshots 3. Convert images to text using customized python code 4. Label the data based on the source of extraction 5. Clean data using customized Python code. 6. Extract attributes like URL, EMAIL and PHONE using customized Python code. We offer an experimental study of this dataset in the following papers. These works present evaluation results for Smishing detection. [1] Sandhya Mishra, Devpriya Soni, Smishing detector: A security model to detect smishing through sms content analysis and url behavior analysis. Future Generation Computer Systems (2020) https://doi.org/10.1016/j.future.2020.03.021 [2] Sandhya Mishra, Devpriya Soni, DSmishSMS: A System to Detect Smishing SMS, Neural Computing and Applications , (2021), 10.1007/s00521-021-06305-y. [3] Sandhya Mishra, Devpriya Soni, Implementation of ‘Smishing Detector’: An Efficient Model for Smishing Detection using Neural Network, S N Computer Science, (2021), https://doi.org/10.1007/s42979-022-01078-0. This Smishing dataset has been created by Ms.Sandhya Mishra and Dr.Devpriya Soni. Email- sandhyashankar20@gmail.com or devpriya.soni@jiit.ac.in We would appreciate it if you would refer to the paper “SMS PHISHING DATASET FOR MACHINE LEARNING AND PATTERN RECOGNITION” in case you find this dataset useful. We would like to thank Tiago A. Almeida and Dr.Gunikhan Sonowal for making the text messages available. ________________________________________ © Sandhya Mishra and Dr.Devpriya Soni, 2022.

Institutions

Jaypee Institute of Information Technology

Categories

Machine Learning Algorithm, Information Security, Mobile Deep Learning

Licence