Web2VecPhishingDataset
Description
The Web2VecPhishingPipeline Dataset is a collection of machine learning–ready datasets designed for research on phishing website detection. The data are generated using the Web2VecPhishingPipeline (https://github.com/edytafraszczak/Web2VecPhishingPipeline), which covers the entire workflow, including phishing and legitimate website collection, data cleaning, Web2Vec-based feature extraction, integration of external datasets, and final artifact generation. The dataset consists of four complementary artifacts, each intended for a different experimental setting. Web2VecFullSpace provides the most comprehensive website representation. It combines multiple feature groups, including URL lexical properties, DNS information, geolocation data, HTML and HTTP characteristics, Open PageRank, SSL-related attributes, and WHOIS-based features. Web2VecUrlSpace is designed for large-scale phishing detection based only on URL-level features. Web2VecUrlMergedSpace extends the URL-based representation by incorporating additional external datasets, increasing its coverage and diversity. Web2VecCommonSpace offers a reduced set of shared lexical features, making it suitable for cross-dataset experiments, interoperability studies, and benchmarking. Artifact Rows Columns Features Numeric features Phishing samples Legitimate samples Web2VecFullSpace 496,850 236 233 182 195,699 301,151 Web2VecUrlSpace 5,666,129 100 98 97 4,666,129 1,000,000 Web2VecUrlMergedSpace 6,490,059 100 98 97 5,045,882 1,444,177 Web2VecCommonSpace 6,753,343 19 18 18 5,091,059 1,662,284
Files
Institutions
- Military University of Technology in WarsawMazovia, Warsaw