Phishing Dataset for Machine Learning: Feature Evaluation

Published: 24 Mar 2018 | Version 1 | DOI: 10.17632/h3cgnj8hft.1
Contributor(s):

Description of this data

This dataset contains 48 features extracted from 5000 phishing webpages and 5000 legitimate webpages, which were downloaded from January to May 2015 and from May to June 2017. An improved feature extraction technique is employed by leveraging the browser automation framework (i.e., Selenium WebDriver), which is more precise and robust compared to parsing approach based on regular expressions. This dataset is WEKA-ready.

Phishing webpage source: PhishTank, OpenPhish
Legitimate webpage source: Alexa, Common Crawl

Anti-phishing researchers and experts may find this dataset useful for phishing features analysis, conducting rapid proof of concept experiments or benchmarking phishing classification models.

Experiment data files

Steps to reproduce

The complete HTML documents and the related resources (e.g., images, CSS, JavaScript) are downloaded using the GNU Wget tool and Python script. This is to ensure proper off-line rendering in the browser. To automate the feature extraction, Selenium WebDriver and Python scripts were utilised to direct the browser to load the webpage, render the webpage content, extract the feature value, and save it to text files. The text files were later processed into a single Weka’s Attribute-Relation File Format (ARFF) file.

This data is associated with the following publication:

A new hybrid ensemble feature selection framework for machine learning-based phishing detection system

Published in: Information Sciences

Latest version

  • Version 1

    2018-03-24

    Published: 2018-03-24

    DOI: 10.17632/h3cgnj8hft.1

    Cite this dataset

    Tan, Choon Lin (2018), “Phishing Dataset for Machine Learning: Feature Evaluation”, Mendeley Data, v1 http://dx.doi.org/10.17632/h3cgnj8hft.1

Statistics

Views: 2952
Downloads: 580

Institutions

Universiti Malaysia Sarawak Faculty of Computer Science and Information Technology

Categories

Data Mining, Machine Learning, Feature Selection, Information Security

Licence

CC BY 4.0 Learn more

The files associated with this dataset are licensed under a Creative Commons Attribution 4.0 International licence.

What does this mean?

This dataset is licensed under a Creative Commons Attribution 4.0 International licence. What does this mean? You can share, copy and modify this dataset so long as you give appropriate credit, provide a link to the CC BY license, and indicate if changes were made, but you may not do so in a way that suggests the rights holder has endorsed you or your use of the dataset. Note that further permission may be required for any content within the dataset that is identified as belonging to a third party.

Report