Synthetic Datasets for Scenario-based Data Breaches
Description
Analyzing real-life data breaches can have issues related to sensitive information (Personally Identifiable Information) that are linked to individuals in real life. We have synthetically generated datasets mimicking real-life scenarios. These datasets are described around fictitious data breaches. We have also provided the code for generating the datasets and provided documentation within the code. There are two kinds of datasets: 1. Master Record Table (MRT) which consists of 4 million records of the individuals profiled with several PIIs which is synthetically generated programatically; 2. 16 Scenario based datasets depicting various fictitious data breaches with varying number of records and PIIs are also distributed across for the variability. Furthermore we have also included the code such that practitioners, researchers and others can modify, test and use the code according to their requirements. This enables transparency in the form of reusability, reproducibility, and replicability. For large datasets, which are above 500 MB, are split and stored along with the original dataset (the code is also provided for splitting the data).
Files
Steps to reproduce
The code is provided with the dataset with comprehensive documentation within the code. The project is coded in Python 3 programming language. The code is provided in an .ipynb format (python notebook). The notebook structure will help the researchers and practitioners to understand, modify and replicate the code with ease.