CBC indices data set
Description
Dataset composition 5,778 — Non-SMOTE-NC (raw) Original tabular hematology dataset (routine analyzer parameters) with four-class labels (Class 0–3; target emphasis on Class 2 = Hb E trait ± α-thalassemia). No resampling, no scaling. Intended for the non-SMOTE pipeline and baseline descriptive statistics. 5,778 — Non-SMOTE-NC — Z-score Same records as (1), with all model features standardized to Z-scores. No class rebalancing. Used for ReliefF ranking, model fitting, and internal testing in the non-SMOTE scenario. 20,000 — SMOTE-NC (raw) Class-rebalanced cohort generated from the 5,778-record source using SMOTE-NC to upsample minority classes while preserving categorical structure. Values remain on the original (unscaled) measurement units. Used for training/validation in the SMOTE-NC scenario. 20,000 — SMOTE-NC — Z-score Z-score–standardized version of (3) for model development in the SMOTE-NC pipeline (feature selection, tuning, and internal testing). 625 — External data — Z-score Independent cohort prepared for model inference with standardized (Z-score) features. Used exclusively for external validation of the final models. 625 — External data (raw) Raw (unscaled) version of the same independent cohort in (5). Retained for auditing, sensitivity checks, and any site-specific recalibration. Notes: All tabular sets contain identical label definitions (Class 0 = normal/non-clinically significant, Class 1 = normal Hb typing ± possible α-thal, Class 2 = Hb E trait ± α-thal, Class 3 = other thalassemic patterns). Z-score versions provide standardized features for ReliefF selection and model input; raw versions support QC and re-scaling if needed. SMOTE-NC sets are for training/validation only; performance is reported on held-out internal tests and on the external cohort (n = 625).
Files
Steps to reproduce
Environment Python ≥3.10. Install: pandas, numpy, scikit-learn (≥1.3), imbalanced-learn (≥0.11; SMOTE-NC), xgboost (≥1.7), shap (≥0.41). Optional: Orange3 (≥3.36). Fix seeds with random_state=42. Data & labels Tabular hematology dataset with four classes (0–3; target emphasis on Class 2 = Hb E trait ± α-thalassemia). Keep an external cohort (n=625) unseen during training. Pipelines Run two parallel pipelines: Non-SMOTE: the original dataset only. SMOTE-NC: apply SMOTE-NC on the training portion later (see Step 7). Train/test split For each pipeline, perform a stratified 70/30 split (fixed class proportions). Save indices. (If reproducing reported counts, use the same saved indices; otherwise, any stratified 70/30 with seed=42 is acceptable.) Scaling (no leakage) Fit Z-score scaler on the training split only (mean/SD), then transform validation/test and the external cohort using the same parameters. Feature selection (ReliefF) Run ReliefF on the training split; select the top 6 features per pipeline. Persist the selected feature list(s). Class rebalancing (SMOTE-NC pipeline only) On the training data only, run SMOTE-NC to synthesize a balanced set with total n=20,000, keeping categorical handling if applicable. Do not apply SMOTE-NC to validation/test or external sets. Model set & tuning Train 9 models: Decision Tree, Random Forest, CatBoost, SVM, MLP (Neural Network), Logistic Regression, Naïve Bayes, Gradient Boosting (scikit-learn), XGBoost. Use Stratified 5-fold CV on the training portion with predefined hyperparameter grids. Primary selection metric: macro/weighted AUC; also track CA, F1, Precision, Recall, MCC, Specificity, LogLoss. Final model Choose XGBoost as the best performer in both pipelines. Typical settings: objective='multi:softprob', num_class=4, tuned n_estimators, learning_rate, max_depth, subsample, colsample_bytree, reg_lambda. Retrain on the full training portion with the best params. Internal evaluation Evaluate on the held-out 30% test set for each pipeline. Report overall AUC/CA and Class 2 metrics: Sensitivity, Specificity, PPV, NPV, PLR, NLR, F1, Precision, Recall, MCC, LogLoss. Compute 95% CIs (e.g., Wilson for proportions; bootstrap for likelihood ratios). Explainability Compute SHAP values for XGBoost on the test set using shap.TreeExplainer. Produce global importance (mean |SHAP|) and beeswarm plots; confirm Z-MCV, Z-MCH, and Z-RBC among top contributors. External validation Apply the same scaler, same 6 features, and the trained XGBoost to the external cohort (n=625). Report the same Class 2 metrics and CIs, plus overall accuracy. Keep the decision rule fixed (argmax or a pre-specified threshold from internal validation). Artifacts & reproducibility Save: scaler params, ReliefF feature list(s), split indices, SMOTE-NC settings, tuned hyperparameters, final model weights, ROC/CM figures, SHAP outputs, and a YAML/JSON config capturing versions, seeds, and paths.
Institutions
Categories
Funders
- Thailand Science Research and Innovation (TSRI) Fundamental Fund, Fiscal year 2025Grant ID: None