A Multi-feature Dataset for Windows PE Malware Classification

Published: 28 December 2022| Version 1 | DOI: 10.17632/vnj7sxkt53.1
Muhammad Irfan Yousuf


This repository contains a multi-feature dataset of Windows PE malware samples. We collected PE malware samples from MalwareBazaar and used pefile library of Python to extract four feature sets. These features can be used for static malware analysis. Moreover, we use VirusTotal API to label these malwares. We categorized them into five families based on majority voting. First feature set (DLLs_Imported.csv file) contains the DLLs imported by each malware family. The first column contains SHA256 values, second column contains the label or family name of the malware while the remaining columns list the names of imported DLLs. Second feature set (API Functions.csv files) contains the API functions called by these malware along with their SHA256 hash values and labels. Third feature set (Header.csv) contains values of 52 fields of PE header. All the fields are labelled in the CSV file. Fourth feature set (Section.csv file) contains 10 field values of 10 different PE sections. All the fields are labelled in the CSV file.


Steps to reproduce

The data were collected in two steps. In the first step, we collected the data from MalwareBazaar Database using its API. Only Windows PE files were targeted in API calls and more than 20,000 samples were downloaded. We used pefile library of Python to extract PE statistics or features from those samples. The samples with incorrect or missing values in PE header were discarded. Similarly, we also discarded samples with code obfuscation. After discarding unwanted samples, we have a total of 18,551 samples in our dataset. In the second step, we submitted the SHA256 hashes of all the samples to VirusTotal using its API for labelling the families of these samples.


University of Engineering and Technology


Machine Learning, Malware Mitigation


Korea Institute of Science and Technology

KIST School Partnership