Processed_ADFA_LD dataset

Published: 30 January 2026| Version 1 | DOI: 10.17632/853sxvpx79.1
Contributor:
Ragavan K

Description

The Processed_ADFA_LD dataset is a preprocessed version of the ADFA Linux Dataset (ADFA-LD), which was originally developed by the Australian Defence Force Academy (ADFA) for research in host-based intrusion detection systems (HIDS). The dataset focuses on detecting malicious behavior at the operating system level by analyzing system call traces generated by Linux processes. Original ADFA-LD Dataset Overview The original ADFA-LD dataset contains system call sequences collected from a Linux environment under two main conditions: Normal behavior: Legitimate system activities generated by standard user operations. Attack behavior: System call traces produced during various cyber-attacks, including privilege escalation, denial-of-service, and remote exploits. Unlike older datasets (e.g., KDD Cup 99), ADFA-LD was designed to reflect modern Linux systems and more realistic attack scenarios. Processing and Transformation The Processed_ADFA_LD dataset refers to the cleaned and transformed version of the original dataset to make it suitable for machine learning and deep learning models. Common preprocessing steps include: System call encoding (e.g., integer mapping or frequency-based encoding) Sequence normalization or padding to handle variable-length traces Feature extraction, such as: n-grams of system calls statistical features (frequency, entropy, transition probabilities) Labeling, typically: 0 → Normal behavior 1 → Attack behavior Train-test splitting for supervised learning experiments Data Characteristics Data type: Sequential / time-series data Features: Encoded system call sequences or derived statistical features Labels: Binary (Normal vs. Attack) or multiclass (depending on processing) Domain: Cybersecurity, Host-Based Intrusion Detection Operating System: Linux Applications The Processed_ADFA_LD dataset is widely used for: Intrusion detection system (IDS) evaluation Anomaly detection research Benchmarking machine learning and deep learning models such as: LSTM / GRU CNN-based sequence models Autoencoders Traditional classifiers (SVM, Random Forest, k-NN) Advantages Reflects realistic modern attack behavior Avoids outdated network-focused features Well-suited for sequence-based learning models Limitations Host-specific (Linux-only) Limited attack diversity compared to large-scale enterprise datasets Requires careful preprocessing due to variable-length sequences

Files

Categories

Cybersecurity, Machine Learning, Intrusion Detection

Licence