IDS2025 (Balanced Intrusion Detection Evaluation Dataset)
Description
This dataset, titled IDS2025: Balanced Intrusion Detection Evaluation Dataset, is an enhanced and refined version of the original CICIDS2017 dataset, designed specifically for research and development in Intrusion Detection Systems (IDS). It addresses key limitations identified in a detailed analysis of the CICIDS2017 dataset, including severe class imbalance (e.g., Benign traffic dominating at 83.34%), high data volume leading to processing challenges, scattered attack instances across files, and inconsistencies in labeling. Key Improvements and Features: Class Balancing: Minority classes have been relabeled and merged where appropriate (e.g., combining similar attack variants like DoS subtypes) to reduce imbalance, improving model training efficacy and reducing bias toward dominant classes like Benign. The resulting distribution aims for a more equitable representation, with prevalence ratios adjusted from extremes like 0.0009% for rare attacks to more balanced levels. Data Volume Optimization: Redundant or low-value instances were resampled or removed, resulting in a more manageable size while preserving essential network traffic patterns. The dataset retains approximately [insert approximate total instances, e.g., 2,830,540 based on original, adjusted post-processing] records across merged classes. Attack Coverage: Includes a comprehensive set of real-world attack scenarios captured from simulated network environments, such as DoS/DDoS (e.g., Hulk, GoldenEye, Slowloris), Brute Force (FTP/SSH), Web Attacks (XSS, SQL Injection), Infiltration, Botnet, PortScan, and Heartbleed. Attacks are now more uniformly distributed across files for easier access and analysis. Features: Comprises 80 network flow features (e.g., flow duration, packet lengths, flags, protocols like HTTP, HTTPS, SSH), extracted using tools like CICFlowMeter, ensuring compatibility with machine learning frameworks for IDS model development. File Structure: Organized into daily CSV files (e.g., Monday-WorkingHours.csv to Friday-WorkingHours.csv) with labeled benign and attack traffic, facilitating chronological analysis of network behavior over a 5-day period. This dataset is ideal for cybersecurity researchers, machine learning practitioners, and IDS developers seeking a benchmark resource for evaluating anomaly detection, classification algorithms, and defensive strategies against modern cyber threats. It supports tasks like binary/multiclass classification, with improved suitability for imbalanced learning techniques. Cite: Panigrahi, R., & Borah, S. (2018). A detailed analysis of CICIDS2017 dataset for designing Intrusion Detection Systems. International Journal of Engineering & Technology, 7(3.24), 479-482. Iman Sharafaldin, Arash Habibi Lashkari, and Ali A. Ghorbani, “Toward Generating a New Intrusion Detection Dataset and Intrusion Traffic Characterization”, 4th International Conference on Information Systems Security and Privacy (ICISSP), Portugal, January 2018.
Files
Steps to reproduce
Algorithm: Creation of the Improved CICIDS2017 Dataset Input: Original CICIDS2017 daily CSV files Output: Cleaned, merged-label dataset with optional size optimization Step 1: Collect and Combine Load all CSV files from the CICIDS2017 dataset, including all working hour and attack-specific files. Concatenate all files into one consolidated dataset. Store the merged dataset for processing. Step 2: Clean the Data Identify and remove rows with missing or undefined class labels. Remove rows with missing feature values. Detect and delete duplicate or redundant rows. Save the cleaned dataset. Step 3: Relabel and Merge Attack Classes (a) Analyze the class frequency distribution to understand imbalance. Replace original attack classes with broader merged categories for improved balance using the following mapping: Benign → Normal, Bot → Botnet, FTP-Patator and SSH-Patator → Brute Force, DoS Hulk, DoS GoldenEye, DoS slowloris, DoS Slowhttptest, and DDoS → DoS/DDoS, Infiltration → Infiltration, PortScan → PortScan, Web Attack – Brute Force, Web Attack – XSS, and Web Attack – SQL Injection → Web Attack (b) Update all instance labels based on the mapping. (c) Verify the final distribution to ensure an improved balance across classes. Step 4: Optimize Dataset Size (Optional) (a) Assess whether the cleaned and relabeled dataset is manageable for intended machine learning workflows. (b) If needed, apply stratified sampling to reduce size while maintaining the updated class proportions. (c) Store the optimized dataset.
Institutions
- Sikkim Manipal Institute of Technology - Majitar
- Amrita Vishwa Vidyapeetham