Chapter 11: Handling Imbalanced Data Sets

Published: 30 October 2023| Version 1 | DOI: 10.17632/frndm7933v.1


The dataset consists of 1,000 entries with three columns: amount, ratio, and fraud. The amount column represents the transaction amount, the ratio column represents some internal metric associated with the transaction, and the fraud column is a binary indicator (0 for legitimate, 1 for fraudulent).


Steps to reproduce

Data Collection: Begin by collecting transaction data. This data should include transaction amounts, a metric that can be represented as a ratio, and a binary indicator for whether the transaction was fraudulent or legitimate. Data Structuring: Organize the data into a structured format with three columns: amount, ratio, and fraud. Ensure that the amount column contains the transaction amounts. The ratio column should contain the internal metric associated with each transaction. The fraud column should be a binary indicator, where 0 represents a legitimate transaction and 1 represents a fraudulent transaction. Data Cleaning: Check for any missing or null values in the dataset and handle them appropriately (either by removing or imputing them). Ensure that the data types for each column are consistent (e.g., float for amount and ratio, integer for fraud). Data Sampling: If you have a larger dataset, you might want to sample 1,000 entries to match the described dataset. Ensure that the sample is representative of the overall data distribution. Data Storage: Convert the structured data into a pandas DataFrame. Ensure that the DataFrame has a memory usage close to 23.6 KB. This can be checked using the info() method on the DataFrame. Data Export: Save the DataFrame to a CSV file named "SMOTE.csv" for easy access and sharing. Use the to_csv() method from pandas to achieve this. Verification: Load the dataset back from the CSV file using pandas' read_csv() method. Display the first 10 and last 10 rows to verify that the data matches the described dataset. By following these steps, you should be able to reproduce a dataset similar to the one described.


Credit Card Fraud