UMUDGA - University of Murcia Domain Generation Algorithm Dataset
Description
In computer security, network botnets still represent a major cyber threat. Concealing techniques such as the dynamic addressing and the Domain Name Generation Algorithms (DGAs) require an improved and more effective detection process. To this extent, this data descriptor presents a collection of over 30 million manually-labelled algorithmically generated domain names decorated with a feature set ready-to-use for Machine Learning analysis. This proposed data set enables researchers to move forward the data collection, organization and pre-processing phases, eventually enabling them to focus on the analysis and the production of Machine-Learning powered solutions for network intrusion detection. To be as exhaustive as possible, 50 among the most important malware variants have been selected. Each family is available both as list of domains and as collection of features. To be more precise, the former is generated by executing the malware DGAs in a controlled environment with fixed parameters, while the latter is generated by extracting a combination of statistical and Natural Language Processing (NLP) metrics.
Files
Steps to reproduce
Data source and relevant code snippets are available in the corresponding folder.