DReLAB dataset - Deep REinforcement Learning Adversarial Botnet dataset

Published: 09-12-2020| Version 1 | DOI: 10.17632/nf22d786tj.1
Andrea Venturi,
Giovanni Apruzzese,
Mauro Andreolini,
Mirco Marchetti,
Michele Colajanni


We present the first dataset that aims to serve as a benchmark to validate the resilience of botnet detectors against adversarial attacks. The dataset includes realistic adversarial samples automatically generated by leveraging two widely used Deep Reinforcement Learning (DRL) techniques. These adversarial samples are proved to evade state of the art detectors based on both Machine- and Deep-Learning algorithms. The initial corpus of malicious samples consists in network flows belonging to different botnet families presented in three public datasets that contain real enterprise network traffic. We use these datasets to devise detectors capable of achieving state-of-the-art performance. We then train two DRL agents, based on Double Deep Q-Network and Deep Sarsa, to generate realistic adversarial samples: the goal is achieving misclassifications by performing small modifications to the initial malicious samples; these alterations involve the features that can be more realistically altered by an expert attacker, and do not compromise the underlying malicious logic of the original samples. Our dataset provides an important contribution to the cybersecurity research community as it is the first that includes thousands of automatically generated adversarial samples that are able to thwart state of the art classifiers with a high evasion rate. The adversarial samples are grouped by malware variant and provided in CSV file format. Researchers can validate their defensive proposals by testing their detectors against the adversarial samples introduced in the proposed dataset. Moreover, the analysis and the study of those samples can pave a way to a deeper comprehension of adversarial attacks; to the development of new and effective defensive techniques; and can also benefit the works in the explainability of machine learning algorithms.


Steps to reproduce

We submit malicious botnet samples from three public datasets to Deep Reinforcement Learning agents (based on Double Deep Q-Network and Deep Sarsa algorithms) trained to evade state-of-the-art botnet detectors (based on Random Forest and Wide and Deep classifiers) by inserting tiny and feasible feature modifications. The dataset is composed of the modified samples that were able to evade the detection provided by the botnet detectors. Tutorial on how to use our data: https://github.com/andreaventuri01/DReLAB_tutorial