Transformed Customer Shopping Dataset with Advanced Feature Engineering and Anonymization
Description
This dataset represents a thoroughly transformed and enriched version of a publicly available customer shopping dataset. It has undergone comprehensive processing to ensure it is clean, privacy-compliant, and enriched with new features, making it highly suitable for advanced analytics, machine learning, and business research applications. The transformation process focused on creating a high-quality dataset that supports robust customer behavior analysis, segmentation, and anomaly detection, while maintaining strict privacy through anonymization and data validation. ➡ Data Cleaning and Preprocessing : Duplicates were removed. Missing numerical values (Age, Purchase Amount, Review Rating) were filled with medians; missing categorical values labeled “Unknown.” Text data were cleaned and standardized, and numeric fields were clipped to valid ranges. ➡ Feature Engineering : New informative variables were engineered to augment the dataset’s analytical power. These include: • Avg_Amount_Per_Purchase: Average purchase amount calculated by dividing total purchase value by the number of previous purchases, capturing spending behavior per transaction. • Age_Group: Categorical age segmentation into meaningful bins such as Teen, Young Adult, Adult, Senior, and Elder. • Purchase_Frequency_Score: Quantitative mapping of purchase frequency to annualized values to facilitate numerical analysis. • Discount_Impact: Monetary quantification of discount application effects on purchases. • Processing_Date: Timestamp indicating the dataset transformation date for provenance tracking. ➡ Data Filtering : Rows with ages outside 0–100 were removed. Only core categories (Clothing, Footwear, Outerwear, Accessories) and the top 25% of high-value customers by purchase amount were retained for focused analysis. ➡ Data Transformation : Key numeric features were standardized, and log transformations were applied to skewed data to improve model performance. ➡ Advanced Features : Created a category-wise average purchase and a loyalty score combining purchase frequency and volume. ➡ Segmentation & Anomaly Detection : Used KMeans to cluster customers into four groups and Isolation Forest to flag anomalies. ➡ Text Processing : Cleaned text fields and added a binary indicator for clothing items. ➡ Privacy : Hashed Customer ID and removed sensitive columns like Location to ensure privacy. ➡ Validation : Automated checks for data integrity, including negative values and valid ranges. This transformed dataset supports a wide range of research and practical applications, including customer segmentation, purchase behavior modeling, marketing strategy development, fraud detection, and machine learning education. It serves as a reliable and privacy-aware resource for academics, data scientists, and business analysts.