Datasets Comparison
Version 1
StealthPhisher
Description
The StealthPhisher dataset is a large, diverse, and up-to-date resource tailored to address the evolving nature of phishing attacks. It contains over 336,749 records, comprising 160,943 legitimate URLs and 175,806 phishing URLs, sourced from platforms like PhishTank, spam email repositories, and user submissions. This dataset reflects recent phishing tactics, making it invaluable for training AI models to detect modern threats.
Key features include URL-based attributes (length, TLD type, IP presence), statistical metrics (Shannon Entropy, Kolmogorov Complexity, Fractal Dimension), and HTML/interaction-based data (popups, redirects, forms). These features provide comprehensive insights into phishing behaviors, enabling precise detection.
Designed to capture real-world scenarios, the dataset equips AI models with the ability to identify both traditional phishing strategies and advanced, evolving attacks. Its scale and focus on recent trends make it an essential tool for advancing AI-driven cybersecurity solutions.
Categories
Cybersecurity, Machine Learning, Deep Learning, Cyber Attack
Licence
Creative Commons Attribution 4.0 International
Version 2
StealthPhisher Phishing Attack Dataset
Description
The StealthPhisher Phishing Attack Dataset, generated at the Cybersecurity Lab, GLA University, Mathura, is a large, diverse, and recent Phishing Attack Dataset developed to address the evolving nature of phishing attacks. It comprises over 336,749 records, including 160,943 legitimate URLs and 175,806 phishing URLs, collected from reliable sources such as PhishTank. Reflecting the most recent phishing tactics, this dataset serves as a valuable resource for training and evaluating AI-based phishing detection systems.
Key features include URL-based attributes (e.g., length, TLD type, IP presence), statistical metrics (e.g., Shannon Entropy, Kolmogorov Complexity, Fractal Dimension), and HTML/interaction-based features (e.g., popups, redirects, forms). These multidimensional attributes provide comprehensive insights into phishing behavior, enabling accurate and robust threat detection. Designed to capture real-world scenarios, the dataset equips AI models to recognize both traditional and emerging phishing strategies effectively.
This dataset was generated as part of the research work presented in the article “StealthPhisher: A Defensive Framework against Phishing Attack using Hybrid Deep Learning and GenAI,” published in Expert Systems with Applications (https://doi.org/10.1016/j.eswa.2025.130205). Researchers using this dataset in their research work are kindly requested to cite this article.
Steps to reproduce
Please refer to the detailed methodology described in the article https://doi.org/10.1016/j.eswa.2025.130205
Institutions
Institutions
GLA University Institute of Engineering and Technology
Categories
Cybersecurity, Machine Learning, Deep Learning, Cyber Attack
Related Links
Licence
Creative Commons Attribution 4.0 International