Benign and Malicious Domains with Lexical and Active Features for Machine Learning-Based Detection
Description
Dataset Composition: Examples: 200,000 Benign: 100,000 Malicious: 100,000 Features: 28 Dataset composed of stratified random samples of benign domains derived from the Majestic Million list and malicious domains obtained from the Hagezi TIF, both collected in the period from 08/15/2025 to 09/15/2025. This dataset was conceived during research work for machine learning-based detection and classification of malicious domains, combining morphological analysis and active DNS collection techniques. DNS enrichment was performed through queries to a recursive server configured in a controlled laboratory environment, enabling large-scale bulk domain resolution. The extracted features are organized into five major categories: Lexical Features (11 features): Extracted directly from domain name analysis: - domain: Domain name string; - length: Character length of domain name; - entropy: Shannon entropy of the domain name; - dash_count: Number of dashes; - dot_count: Number of dots/separators; - vowel_count: Number of vowels; - number_count: Number of digits; - consonant_count: Number of consonants; - vowel_consec: Consecutive vowels count; - number_consec: Consecutive numbers count; - consonant_consec: Consecutive consonants count; DNS Records Features (5 features): Information from active DNS queries: - a_count: Address (A) records count; - aaaa_count: Quad-A (AAAA) records count; - ns_count: Name Server (NS) records count; - mx_count: Mail Exchange (MX) records count; - cname_count: Canonical Name (CNAME) records count; SOA (Start of Authority) Features (5 features): DNS SOA record parameters: - soa_retry: Retry parameter; - soa_refresh: Refresh parameter; - soa_minimum: Minimum parameter; - soa_expire_length: Expiration time length; - soa_serial_length: Serial number length; TTL (Time-to-Live) Based Features (4 features): average and standard deviation for A, MX, CNAME and AAAA records: - ttl_a_cname_avg: Average TTL across A and CNAME records; - ttl_a_cname_std: Standard deviation of TTL across A and CNAME records; - ttl_mx_avg: Average TTL for MX records; - ttl_aaaa_avg: Average TTL for AAAA records; Geolocation and Network Features (2 features): IP-based analysis - num_ips: Number of IP addresses (IPv4 and IPv6) associated with domain; - num_countries: Geographic diversity (number of countries) associated with domain; Classification Target (1 feature): - malicious: Binary label (0 = benign, 1 = malicious) The dataset provides a robust resource for training and evaluating machine learning models for binary domain classification (benign vs malicious). It is particularly suited for deep learning architectures, ensemble methods, and traditional supervised learning algorithms.
Files
Institutions
- Universidade Estadual Paulista Julio de Mesquita Filho