url phishing
Description
The dataset employed in this study is a large-scale, clustered phishing detection dataset designed to support advanced machine learning (ML), deep learning (DL), and hybrid AI-based approaches for identifying phishing and malicious URLs. The specific dataset under consideration, referred to as the Cluster dataset, contains 147,292 individual samples, each corresponding to a unique URL instance. These instances represent both malicious (phishing) and benign (legitimate) URLs collected from multiple heterogeneous sources, ensuring diversity in terms of domain structure, hosting infrastructure, and attack sophistication. The dataset is structured for binary classification, making it suitable for supervised learning paradigms. Each sample is described by 112 numerical features, all of which are derived from URL strings, domain metadata, DNS records, and network-level observations. The exclusive use of numeric features eliminates the need for extensive encoding or tokenization steps, allowing direct compatibility with a wide range of ML and DL algorithms. Class Labels and Distribution The target variable in the dataset is denoted as label, which follows a binary encoding scheme: Label = 1: Indicates phishing or malicious URLs Label = 0: Indicates legitimate or benign URLs Out of the total 147,292 samples, the dataset includes: 61,294 malicious URLs (positive class) 85,998 benign URLs (negative class) This distribution reflects a moderate class imbalance, with benign URLs slightly dominating the dataset. Such imbalance is typical of real-world cybersecurity datasets, where legitimate traffic generally exceeds malicious activity. The presence of this imbalance makes the dataset particularly useful for evaluating classifier robustness, precision–recall trade-offs, and cost-sensitive learning strategies. URL-Centric Feature Design URLs remain one of the most widely exploited vectors for phishing attacks, serving as entry points for credential theft, malware delivery, and social engineering campaigns. Modern phishing URLs often employ lexical obfuscation, domain impersonation, excessive parameterization, and short-lived infrastructure to evade detection. To address these challenges, the dataset emphasizes URL-based characteristics that capture both surface-level patterns and deep structural cues associated with malicious intent. The selected features aim to balance interpretability, discriminative power, and computational efficiency, making them suitable for both traditional ML models and complex DL architectures. Feature Composition and Categorization The 112 features in the dataset can be broadly categorized into the following groups: Lexical and Character-Level Features Structural and Length-Based Features Directory and Parameter Analysis Features Domain and Host-Based Features Network and Infrastructure-Level Features Security and Certificate-Related Features
Files
Steps to reproduce
teps to Reproduce the Clustered Phishing URL Dataset Data Collection: Collect malicious URLs from public phishing intelligence sources and benign URLs from trusted web repositories. Merge all URLs into a unified dataset and remove duplicates. Label Normalization: Standardize labels into binary form, where 1 represents phishing/malicious URLs and 0 represents legitimate URLs. Remove samples with missing or invalid labels. Feature Extraction: Extract 112 numerical URL-based features, including lexical features (e.g., qty_dot_url, qty_hyphen_url), structural features (e.g., length_url, domain_length), directory and parameter features (e.g., qty_slash_directory, qty_questionmark_params), and network/security features (e.g., time_response, asn_ip, tls_ssl_certificate). Data Cleaning and Formatting: Handle missing values using safe numeric imputation, convert all attributes to numeric format, and remove corrupted or low-quality samples. Feature Optimization: Perform variance and correlation analysis to remove redundant and low-variance features. Retain 86 high-informative features such as email_in_url, time_domain_activation, and qty_redirects. Clustering: Apply unsupervised clustering on the optimized feature set to group URLs based on behavioral similarity, capturing distinct phishing strategies. Dataset Finalization: Export cluster-wise datasets (e.g., cluster_2_merged.csv) containing fully numeric features and binary labels. The final dataset includes 147,292 samples with a moderate class imbalance.
Institutions
- MIT Art, Design and Technology UniversityMaharashtra, Pune