Web page phishing detection

Published: 28-09-2020| Version 2 | DOI: 10.17632/c2gw7fy2j4.2
Abdelhakim Hannousse,
Salima Yahiouche


The provided dataset includes 11430 URLs with 87 extracted features. The dataset are designed to be used as a a benchmark for machine learning based phishing detection systems. Features are from three different classes: 56 extracted from the structure and syntax of URLs, 24 extracted from the content of their correspondent pages and 7 are extracetd by querying external services. The datatset is balanced, it containes exactly 50% phishing and 50% legitimate URLs. Associated to the dataset, we provide Python scripts used for the extraction of the features for potential replication or extension. dataset_A: contains a list a URLs together with their DOM tree objects that can be used for replication and experimenting new URL and content-based features overtaking short-time living of phishing web pages. dataset_B: containes the extracted feature values that can be used directly as inupt to classifiers for examination. Note that the data in this dataset are indexed with URLs so that one need to remove the index before experimentation. Datasets are constructed on May 2020. Due to huge size of dataset A, only a sample of the dataset is provided, it will be divided into sample files and uploaded one by one, for urgent need of full copy, please contact directly the author at: hannousse.abdelhakim@univ-guelma.dz