Child Mortality Risk Factor Dataset

Published: 19 May 2025| Version 1 | DOI: 10.17632/cfwnrgd9jm.1
Contributor:
Neelamcadhab Padhy

Description

Title: Child Mortality Risk Factor Dataset Description: A dataset containing sociodemographic, maternal, and child health variables to predict child mortality (binary classification: 0 = survived, 1 = died). Key Variables: Demographics: Mother’s age, father’s age, residence (urban/rural). Maternal/Child Health: Birth order, birth weight (kg), antenatal visits, institutional delivery (yes/no), vaccination status (yes/no), low birth interval (yes/no). Socioeconomic Factors: Mother’s education level (no/primary/secondary/higher), wealth index (poor/middle/rich), access to water (yes/no), toilet facility (yes/no). Target Variable: Mortality (0 or 1).B. Variable Descriptions Add a table or list explaining each column: Variable Type Description mother_age Numerical Age of the mother in years. father_age Numerical Age of the father in years. birth_order Numerical Birth order of the child (e.g., 1 = firstborn). birth_weight Numerical Birth weight in kilograms (kg). mother_education Categorical Education level: No/Primary/Secondary/Higher. wealth_index Categorical Socioeconomic status: Poor/Middle/Rich. residence Categorical Urban or rural residence. antenatal_visits Numerical Number of antenatal care visits during pregnancy. institutional_delivery Binary Whether delivery occurred in a healthcare facility (Yes/No). vaccination_status Binary Whether the child received vaccinations (Yes/No). access_to_water Binary Access to clean water (Yes/No). toilet_facility Binary Access to improved sanitation (Yes/No). low_birth_interval Binary Short interval between pregnancies (Yes/No). mortality Binary Target variable: 0 = survived, 1 = died. Purpose: Predict child mortality risk using machine learning (e.g., logistic regression, decision trees, neural networks). Keywords: Child mortality, predictive modeling, socioeconomic factors, maternal health, machine learning

Files

Steps to reproduce

This synthetic dataset was generated to simulate child mortality risk factors while avoiding privacy concerns associated with real-world health data. It mimics patterns observed in public health surveys (e.g., Demographic and Health Surveys) for benchmarking machine learning models.Synthetic Data Generation Methodology Detail the tools, algorithms, and assumptions used to create the dataset: A. Tools and Software Synthetic Data Libraries: Python: Synthetic Data Vault (SDV), CTGAN, Faker, or scikit-learn (for sampling). R: synthpop, simstudy. Example: "The dataset was generated using the Synthetic Data Vault’s CTGAN model, trained on summary statistics from real-world child mortality studies." Custom Scripts: "Python scripts with pandas and numpy were used to manually define distributions for variables like birth_weight and mother_age." Validation of Synthetic Data : We used the Statistical Comparisons: Compare distributions (mean, variance) of synthetic data to real-world benchmarks. Domain Expert Review: Pediatricians or public health experts reviewed variable relationships (e.g., linkage between antenatal_visits and mortality). Utility Testing: Trained baseline models (e.g., logistic regression) on synthetic data and compared performance to models trained on real data

Categories

Computer Science

Licence