Paper_IJPE_Repository_1_Dataset_Purchases_Original_and_Augmented
Description
This repository contains two datasets: Original Dataset (100 rows) – A manufacturer-provided dataset of purchased items. Augmented Dataset (10,000 rows) – A synthetically generated dataset designed for use in the FP-Growth algorithm to extract risk interdependency rules. The augmentation process was performed using a Synthetic Data Generation technique based on Probabilistic Distribution, ensuring that newly generated categorical values align with the original data’s probability distribution. To maintain logical consistency, the algorithm leverages conditional probability distributions to preserve attribute relationships and dependencies. This approach guarantees realistic, coherent, and statistically valid synthetic data.
Files
Steps to reproduce
Steps of the algorithm: 1. Reads the original CSV file (100 rows). 2. Extracts categorical distributions for each attribute. 3. Identifies pairwise dependencies (e.g., Quality vs. Price, Finances vs. Risk Flag). 4. Uses a Conditional Probability Model. Uses joint probability tables (P(A, B)) to improve relationship accuracy. 5. Generates 10,000 synthetic rows that preserve category distributions and dependencies. 6. Saves the synthetic dataset to a new CSV file.