Ransomware and user samples for training and validating ML models
Description
Ransomware is considered as a significant threat for most enterprises since past few years. In scenarios wherein users can access all files on a shared server, one infected host is capable of locking the access to all shared files. In the article related to this repository, we detect ransomware infection based on file-sharing traffic analysis, even in the case of encrypted traffic. We compare three machine learning models and choose the best for validation. We train and test the detection model using more than 70 ransomware binaries from 26 different families and more than 2500 h of ‘not infected’ traffic from real users. The results reveal that the proposed tool can detect all ransomware binaries, including those not used in the training phase (zero-days). This paper provides a validation of the algorithm by studying the false positive rate and the amount of information from user files that the ransomware could encrypt before being detected. This dataset directory contains the 'infected' and 'not infected' samples and the models used for each T configuration, each one in a separated folder. Each folder (for example 10s/) contains 8 files: - tree.py -> Python script with the Tree model. - ensemble.json -> JSON file with the information about the Ensemble model. - NN_XhiddenLayer.json -> JSON file with the information about the NN model with X hidden layers (1, 2 or 3). - train_data.csv -> All samples used for training each model in this folder. It is in csv format for using in bigML application. - zeroDays.csv -> All zero-day samples used for testing each model in this folder. It is in csv format for using in bigML application. - FPtest_data -> All samples used for validating each model in this folder. It is in csv format for using in bigML application. The six models present in each folder are the configuration reached training it with the train_data.csv file. The validation and the measurement of the data lost have been done with the FPtest_data.csv and the zeroDays.csv files respectively. The files containing samples (train_data.csv, zeroDays.csv and FPtest_data.csv) are structured as follows: - Each line is one sample. - Each sample has 3*T features and the label (1 if it is 'infected' sample and 0 if it is not). - The features are separated by ',' because it is a csv file. - The last column is the label of the sample.
Files
Institutions
- Universidad Publica de Navarra