Composed Encrypted Malicious Traffic Dataset for machine learning based encrypted malicious traffic analysis.
This is a traffic dataset which contains balance size of encrypted malicious and legitimate traffic for encrypted malicious traffic detection. The dataset is a secondary csv feature data which is composed of five public traffic datasets. Our dataset is composed based on three criteria: The first criterion is to combine widely considered public datasets which contain both encrypted malicious and legitimate traffic in existing works, such as the Malwares Capture Facility Project dataset and the CICIDS-2017 dataset. The second criterion is to ensure the data balance, i.e., balance of malicious and legitimate network traffic and similar size of network traffic contributed by each individual dataset. Thus, approximate proportions of malicious and legitimate traffic from each selected public dataset are extracted by using random sampling. We also ensured that there will be no traffic size from one selected public dataset that is much larger than other selected public datasets. The third criterion is that our dataset includes both conventional devices' and IoT devices' encrypted malicious and legitimate traffic, as these devices are increasingly being deployed and are working in the same environments such as offices, homes, and other smart city settings. Based on the criteria, 5 public datasets are selected. After data pre-processing, details of each selected public dataset and the final composed dataset are shown in “Dataset Statistic Analysis Document”. The document summarized the malicious and legitimate traffic size we selected from each selected public dataset, proportions of selected traffic size from each selected public dataset with respect to the total traffic size of the composed dataset (% w.r.t the composed dataset), proportions of selected encrypted traffic size from each selected public dataset (% of selected public dataset), and total traffic size of the composed dataset. From the table, we are able to observe that each public dataset equally contributes to approximately 20% of the composed dataset, except for CICDS-2012 (due to its limited number of encrypted malicious traffic). This achieves a balance across individual datasets and reduces bias towards traffic belonging to any dataset during learning. We can also observe that the size of malicious and legitimate traffic are almost the same, thus achieving class balance. The datasets now made available were prepared aiming at encrypted malicious traffic detection. Since the dataset is used for machine learning model training, a sample of train and test sets are also provided. The train and test datasets are separated based on 1:4 and stratification is applied during data split. Such datasets can be used directly for machine or deep learning model training based on selected features.
Steps to reproduce
Based on the three criteria mentioned above in Description, five public datasets were selected to make our composed dataset: 1. Malware Capture Facility Project Dataset. 2. CICIDS-2012 Dataset. 3. CIC-AndMal 2017 Dataset. 4. CICIDS-2017 Dataset. 5. UNSW NS 2019 Dataset Firstly, we downloaded the raw PCAP/PCAPNG traffic provided by the five datasets. In the process, we screened out some small PCAP/PCAPNG files that were less than 1MB, because their limited traffic data do not provide enough help for our dataset. Then, Wireshark is used to analyze the PCAP/PCAPNG files downloaded by each public dataset, and irrelevant traffic packets were removed, such as Address Resolution Protocol (ARP) or Internet Control Message Protocol (ICMP) packets, because they are not applicable in the research of encrypted malicious traffic detection. After finishing data pre-processing, we fed the pre-processed PCAP/PCAPNG file into our Python based feature extraction function. The function is built based on dpkt and scapy library. The pre-processed PCAP/PCAPNG files can be analyzed while extracting features we need. Through our feature extraction function, 113 features of traffic were extracted and output into the csv files we need. The next step is to prepare for dataset merging. In order to reduce several biases generated from merging different datasets from multiple vantage points and at different times, we selected equal data size from each selected public dataset by using random sampling to the final composed dataset. We also ensured that the encrypted malicious and legitimate traffic of each selected public dataset are balanced (thus, the encrypted malicious and legitimate traffic of the final composed dataset are also balanced) as well. Therefore, approximate number of malicious and legitimate traffic from each selected public dataset are extracted. We also ensured that there will be no traffic size from one selected public dataset that is much larger than other selected public datasets. This arrangement ensures that the dataset will minimize biases during model training. After selecting a sufficient number of traffic sessions from each public dataset, we applied truncation to the datasets. We found that due to the different number of packets contained in each session, some sessions may contain tens of thousands of packets while some sessions contain less than 10 packets. Therefore, in order to minimize the size of the data and ensure the variability of the encrypted traffic as much as possible, we fed the traffic feature data into a truncation function written by Python. We limited the maximum number of packets contained in each session to 15 packets. Sessions with more than 15 packets will be only kept at 15 packets and those with less than 15 packets will not be edited. When these steps are completed, the processed traffic feature files will be merged together through the same feature column names.