Synthetic Synthea patient datasets for lung cancer risk prediction machine learning
Description
These synthetic patient datasets were created for machine learning (ML) study of lung cancer risk prediction and simulation study of learning health systems. 1. In subfolder "unconverted": Five populations of 30K patients were generated by the Synthea patient generator. About 1100 lung cancer patients and 3000 control patients (without lung cancer) were selected and their electronic health records (EHR) were processed to data table files ready for machine learning using common algorithms like XGBoost. 2. In root directory: The five 30K-patient datasets were combined sequentially to form 5 different size datasets, from 30K to 150K patients. The new datasets were resampled to keep all lung cancer patients plus about 3x control patients. The ML-ready table files also had the continuous numeric values converted to categorical values. Because Synthea patients are closely resemble real patients, the Synthea patient data can be used to develop and test ML algorithms and pipelines, and train researchers. Unlike real patient data, these Synthea datasets can be shared with collaborators anywhere without privacy concerns. The first LHS simulation study titled "Simulation of a machine learning enabled learning health system for risk prediction using synthetic patient data" has been published in Nature Scientific Reports (see https://www.nature.com/articles/s41598-022-23011-4).