Predicting Diabetes From Tracking Medical Records
Description
This dataset contains 1,168 medical records designed for predicting the onset of diabetes based on routine diagnostic measurements. Each record includes eight clinical features commonly collected during standard health screenings, along with a binary outcome variable indicating whether the patient was diagnosed with diabetes. Features: Pregnancies: Number of times the patient has been pregnant Glucose: Plasma glucose concentration from a 2-hour oral glucose tolerance test (mg/dL) BloodPressure: Diastolic blood pressure (mm Hg) SkinThickness: Triceps skinfold thickness (mm) Insulin: 2-hour serum insulin level (μU/mL) BMI: Body mass index calculated as weight in kg / (height in m)² DiabetesPedigreeFunction: A composite score reflecting the likelihood of diabetes based on family history Age: Age of the patient in years Outcome: Binary target variable (1 = diabetes diagnosed, 0 = no diabetes) The dataset comprises 771 negative cases and 397 positive cases, representing a class imbalance ratio of approximately 66:34. Patient ages range from 21 to 81 years. Some feature columns contain zero values (e.g., Glucose, BloodPressure, SkinThickness, Insulin, BMI) that likely represent missing or unrecorded measurements rather than true biological zeros; researchers should account for this during preprocessing. This dataset is well suited for supervised binary classification tasks and can be used to benchmark machine learning models such as logistic regression, decision trees, random forests, gradient boosting, support vector machines, and neural networks. It is also appropriate for educational purposes in data science and healthcare analytics curricula, including exercises in exploratory data analysis, feature engineering, handling missing values, class imbalance techniques, and model evaluation. The data was prepared and exported from VertexMD, a local-first electronic health records application designed for personal medical record tracking and interoperability research.
Files
Steps to reproduce
- Records were filtered to include adult patients (age 21+) with complete or partial entries for the eight diagnostic features: Pregnancies, Glucose, BloodPressure, SkinThickness, Insulin, BMI, DiabetesPedigreeFunction, and Age. - The binary Outcome variable was assigned based on documented diabetes diagnosis status (1 = diagnosed, 0 = not diagnosed). - Data was exported from the application as a CSV file with 1,168 rows and 9 columns. - No additional cleaning or imputation was performed prior to publication; zero values in certain columns may indicate missing data and should be handled by the end user during analysis.