Encrypted Traffic Feature Dataset for Machine Learning and Deep Learning based Encrypted Traffic Analysis
This traffic dataset contains a balance size of encrypted malicious and legitimate traffic for encrypted malicious traffic detection and analysis. The dataset is a secondary csv feature data that is composed of six public traffic datasets. Our dataset is curated based on two criteria: The first criterion is to combine widely considered public datasets which contain enough encrypted malicious or encrypted legitimate traffic in existing works, such as Malware Capture Facility Project datasets. The second criterion is to ensure the final dataset balance of encrypted malicious and legitimate network traffic. Based on the criteria, 6 public datasets are selected. After data pre-processing, details of each selected public dataset and the size of different encrypted traffic are shown in the “Dataset Statistic Analysis Document”. The document summarized the malicious and legitimate traffic size we selected from each selected public dataset, the traffic size of each malicious traffic type, and the total traffic size of the composed dataset. From the table, we are able to observe that encrypted malicious and legitimate traffic equally contributes to approximately 50% of the final composed dataset. The datasets now made available were prepared to aim at encrypted malicious traffic detection. Since the dataset is used for machine learning or deep learning model training, a sample of train and test sets are also provided. The train and test datasets are separated based on 1:4. Such datasets can be used for machine learning or deep learning model training and testing based on selected features or after processing further data pre-processing.
Steps to reproduce
Based on the criteria mentioned above in the Description, six public datasets were selected to curate our composed dataset: CTU-Malware-Capture, Benign-Capture, and Mixture Capture are three datasets produced from Malware Capture Facility Project published by Stratosphere Lab. CICIDS-2017, CICIDS-2012, and CIRA-CIC-DoHBRW-2020 are three datasets published by the Canadian Institute for Cybersecurity (CIC). Firstly, we downloaded the raw PCAP/PCAPNG traffic provided by the six datasets. In the process, we screened out small-size PCAP/PCAPNG files, because their limited traffic data do not provide enough help for our dataset. We also screened out PCAP/PCAPNG files with no encrypted traffic or with very little encrypted traffic as their limited encrypted traffic does not provide enough help for the dataset as well. Then, Wireshark is used to analyze the PCAP/PCAPNG files downloaded by each public dataset, and irrelevant traffic packets were removed, such as Address Resolution Protocol (ARP) or Internet Control Message Protocol (ICMP) packets, because they are not applicable in the research of encrypted malicious traffic analysis and detection. After finishing data pre-processing, we fed the pre-processed PCAP/PCAPNG file into our Python-based updated feature extraction function. The function is built based on the dpkt, communityid, and scapy library. The pre-processed PCAP/PCAPNG files can be analyzed while extracting the features we need. Through our feature extraction function, 305 features of traffic were extracted and output into the CSV files we need. Due to the huge amount of traffic data, we separated the traffic into two kinds of CSV files according to packet level and session level. These two files are linked by the same unique hash value. The researchers are free to make the combination of selected features or further process the feature engineering according to the needs of the feature set. The next step is to prepare for dataset merging. We ensured that the encrypted malicious and legitimate traffic sessions are balanced. This arrangement ensures that the dataset will minimize biases during model training. After selecting a sufficient number of traffic sessions from each public dataset, we applied truncation to traffic sessions. The reason is that due to the different numbers of packets contained in each session, some sessions may contain tens of thousands of packets while some sessions contain less than 10 packets. Therefore, in order to minimize the size of the data and ensure the variability of the encrypted traffic as much as possible, we fed the traffic feature data into a truncation function written by Python. We limited the maximum number of packets contained in each session to 15 packets. Sessions with more than 15 packets will be only kept at 15 packets and those with less than 15 packets will not be edited. After above steps, the processed traffic feature files will be merged together through the same feature column names.