A Multi-feature Dataset for Windows PE Malware Classification
This repository contains a multi-feature dataset of Windows PE malware samples. We collected PE malware samples from MalwareBazaar and used pefile library of Python to extract four feature sets. These features can be used for static malware analysis. Moreover, we use VirusTotal API to label these malwares. We categorized them into five families based on majority voting. First feature set (DLLs_Imported.csv file) contains the DLLs imported by each malware family. The first column contains SHA256 values, second column contains the label or family name of the malware while the remaining columns list the names of imported DLLs. Second feature set (API Functions.csv files) contains the API functions called by these malware along with their SHA256 hash values and labels. Third feature set (Header.csv) contains values of 52 fields of PE header. All the fields are labelled in the CSV file. Fourth feature set (Section.csv file) contains 10 field values of 10 different PE sections. All the fields are labelled in the CSV file.
Steps to reproduce
The data were collected in two steps. In the first step, we collected the data from MalwareBazaar Database using its API. Only Windows PE files were targeted in API calls and more than 20,000 samples were downloaded. We used pefile library of Python to extract PE statistics or features from those samples. The samples with incorrect or missing values in PE header were discarded. Similarly, we also discarded samples with code obfuscation. After discarding unwanted samples, we have a total of 18,551 samples in our dataset. In the second step, we submitted the SHA256 hashes of all the samples to VirusTotal using its API for labelling the families of these samples.
Korea Institute of Science and Technology
KIST School Partnership