Benign and Malicious Domains with Lexical and Active Features for Machine Learning-Based Detection

Name: Benign and Malicious Domains with Lexical and Active Features for Machine Learning-Based Detection
Creator: Guilherme Romanholo Bofo
Published: 2025-12-24T22:13:17.147Z
Keywords: Computer Network, Cybersecurity, Network Security

Romanholo Bofo, Guilherme; Gregório, João Rafael; Mauro Cansian, Adriano

doi:10.17632/xc45t9tp96.1

Benign and Malicious Domains with Lexical and Active Features for Machine Learning-Based Detection

Published: 24 December 2025| Version 1 | DOI: 10.17632/xc45t9tp96.1

Contributors:

,

Description

Dataset Composition: Examples: 200,000 Benign: 100,000 Malicious: 100,000 Features: 28 Dataset composed of stratified random samples of benign domains derived from the Majestic Million list and malicious domains obtained from the Hagezi TIF, both collected in the period from 08/15/2025 to 09/15/2025. This dataset was conceived during research work for machine learning-based detection and classification of malicious domains, combining morphological analysis and active DNS collection techniques. DNS enrichment was performed through queries to a recursive server configured in a controlled laboratory environment, enabling large-scale bulk domain resolution. The extracted features are organized into five major categories: Lexical Features (11 features): Extracted directly from domain name analysis: - domain: Domain name string; - length: Character length of domain name; - entropy: Shannon entropy of the domain name; - dash_count: Number of dashes; - dot_count: Number of dots/separators; - vowel_count: Number of vowels; - number_count: Number of digits; - consonant_count: Number of consonants; - vowel_consec: Consecutive vowels count; - number_consec: Consecutive numbers count; - consonant_consec: Consecutive consonants count; DNS Records Features (5 features): Information from active DNS queries: - a_count: Address (A) records count; - aaaa_count: Quad-A (AAAA) records count; - ns_count: Name Server (NS) records count; - mx_count: Mail Exchange (MX) records count; - cname_count: Canonical Name (CNAME) records count; SOA (Start of Authority) Features (5 features): DNS SOA record parameters: - soa_retry: Retry parameter; - soa_refresh: Refresh parameter; - soa_minimum: Minimum parameter; - soa_expire_length: Expiration time length; - soa_serial_length: Serial number length; TTL (Time-to-Live) Based Features (4 features): average and standard deviation for A, MX, CNAME and AAAA records: - ttl_a_cname_avg: Average TTL across A and CNAME records; - ttl_a_cname_std: Standard deviation of TTL across A and CNAME records; - ttl_mx_avg: Average TTL for MX records; - ttl_aaaa_avg: Average TTL for AAAA records; Geolocation and Network Features (2 features): IP-based analysis - num_ips: Number of IP addresses (IPv4 and IPv6) associated with domain; - num_countries: Geographic diversity (number of countries) associated with domain; Classification Target (1 feature): - malicious: Binary label (0 = benign, 1 = malicious) The dataset provides a robust resource for training and evaluating machine learning models for binary domain classification (benign vs malicious). It is particularly suited for deep learning architectures, ensemble methods, and traditional supervised learning algorithms.

Files

Institutions

Universidade Estadual Paulista Julio de Mesquita Filho

Funders

Fundação para o Desenvolvimento da UNESP
Universidade Estadual Paulista (Unesp)
Brazil

Benign and Malicious Domains with Lexical and Active Features for Machine Learning-Based Detection

Description

Files

Institutions

Categories

Funders

Related Links

Licence