Synthetic Network Traffic Dataset for Anomaly Detection Using Machine Learning in SDN Environments
Description
This dataset contains 10,005 synthetic network flow records designed to support research in network anomaly detection, particularly in Software-Defined Networking (SDN) environments. The core hypothesis behind the dataset's creation is that statistical analysis of flow-level features—such as connection duration, packet count, byte transmission, protocol usage, and port behavior—can effectively distinguish between normal and malicious traffic patterns. The dataset simulates realistic traffic scenarios, representing both benign network flows and various types of malicious activity, including DDoS attacks, port scanning, data exfiltration, brute-force authentication attempts, and protocol misuse. Each record includes temporal metadata, IP addresses, port numbers, protocol types, connection duration, packet counts, and byte statistics. The integration of machine learning methods, particularly the Random Forest algorithm, into Software-Defined Networks (SDN) enables the creation of an efficient and adaptive system for detecting DDoS attacks. Such a system ensures high accuracy in network traffic classification, timely anomaly detection, and minimizes the impact of false positives on network performance/ Characteristics of Normal Traffic: Primarily TCP traffic (70%) with standard HTTP/HTTPS, SSH, and DNS communications Connection durations typically range from 0.1 to 5.0 seconds Packet counts range between 5–100 per connection Balanced byte transmission patterns reflecting typical client-server interactions Use of standard service ports (80, 443, 22, 53, etc.) Characteristics of Anomalous Traffic: Very short connection durations (0.01–0.1s), minimal packet counts (1–5), and low byte transmissions typical of port scanning Extremely high packet counts (1,000–10,000) with disproportionate ratios of sent to received bytes Extended connection durations (10–60s) and high outbound byte transmission (100KB–1MB) during DDoS or data exfiltration attacks Repeated connections to authentication ports (SSH-22, RDP-3389), moderate duration and packet count during brute-force attacks Use of uncommon protocols (GRE, ESP, AH) with non-standard port combinations indicating protocol anomalies Dataset Applications: Training supervised machine learning models for anomaly detection in network traffic Evaluating classification algorithm performance on imbalanced network security datasets Testing feature engineering techniques for traffic analysis Educational use in cybersecurity and network monitoring courses Label Interpretation: Label 0 represents normal, benign network traffic Label 1 represents anomalous, potentially malicious network traffic.
Files
Steps to reproduce
This dataset was generated using a systematic synthetic data generation approach implemented in Python, designed to model realistic network traffic patterns while allowing reproducible and controllable anomaly injection. The methodology combines statistical modeling of normal network behavior with targeted generation of specific attack patterns. Software Environment • Python 3.10 • Key Libraries: pandas, numpy, ipaddress, datetime, plus standard modules (os, time, logging, argparse) • OS compatibility: Windows and Unix-like systems Install dependencies: pip install pandas numpy Traffic Generation Protocol Phase 1: Normal Traffic • IP Ranges: o Source: 192.168.x.x o Destination: 10.0.x.x o Converted to integers using ipaddress.IPv4Address • Protocol Distribution: o TCP (70%), UDP (25%), ICMP (5%) • Ports: o Source: random ephemeral (1024–65535) o Destination: weighted among common ports (80, 443, 22, etc.) • Other Features: o Duration: 0.1–5.0 sec o Packets: 5–100 o Bytes sent/received: 100–15000 o Timestamps: generated in 10-second intervals Phase 2: Anomalous Traffic Five anomaly types were modeled: 1. Port Scans – Short duration, single packet 2. DDoS – High packet count (1K–10K), unbalanced data flow 3. Data Exfiltration – Long sessions, high outbound bytes (100KB–1MB) 4. Brute-force – Repeated access attempts to SSH/Telnet/RDP 5. Protocol Anomalies – Use of GRE, ESP, AH with uncommon ports Phase 3: Specific Anomaly Injection Five hardcoded scenarios simulate known threats: • DDoS from external IP (45.12.34.56) • Internal port scanning • Large data exfiltration • SSH brute-force attempt • Abnormal GRE tunnel usage Run Instructions Default command: python Create_dataset.py --normal 8000 --anomalous 2000 --output network_traffic.csv Arguments: • --normal: normal samples (default: 8000) • --anomalous: anomalous samples (default: 2000) • --output: output file name • --no-specific: exclude specific anomalies Data Pipeline 1. Logging initialization (./log directory) 2. Normal traffic generation using probabilistic models 3. Anomaly creation with predefined statistical profiles 4. Injection of specific attack scenarios 5. Data shuffling to eliminate order bias 6. Feature engineering (e.g., bytes-per-packet) 7. CSV export with pandas Validation & QA • Timestamped logging • Statistical checks of distributions • IP address validation (ipaddress module) • Automatic directory creation • Error handling for I/O Post-generation checks: • Total record count • Class balance (normal vs. anomaly) • Sample inspection logs • Generation completion timestamp This setup allows consistent, customizable dataset generation for anomaly detection benchmarking and reproducible experimentation.
Institutions
- Harkivskij nacional'nij ekonomicnij universitet imeni Semena Kuzneca