Balanced Prostate Cancer Clinical Dataset with Hematological and Diagnostic Indicators for Risk Classification

Published: 18 March 2026| Version 1 | DOI: 10.17632/hrx8yms94t.1
Contributors:
,

Description

This dataset presents a clinically structured and balanced collection of prostate cancer patient records designed for data-driven analysis and machine learning applications in oncology research. The dataset is derived from a publicly available prostate cancer dataset and has been expanded and preprocessed to enhance its usability for classification and predictive modeling tasks. The final dataset consists of 1,752 patient records, each described by a set of clinically relevant features associated with prostate cancer progression and diagnosis. The dataset includes 8 independent variables and 1 binary target variable, representing patient risk classification. To improve the reliability and robustness of analytical models, the dataset has been balanced with a class distribution of 55% (class 0) and 45% (class 1). Class 0 represents lower-risk or non-critical cases, while class 1 represents higher-risk or clinically significant cases. The dataset includes important clinical indicators such as tumor volume, prostate-specific antigen (PSA) levels, Gleason score, and other diagnostic measurements widely used in oncology practice. Several features are log-transformed to maintain statistical consistency and improve modeling performance. The dataset contains several clinically relevant features used in prostate cancer assessment. The variable lcavol represents log-transformed cancer volume, indicating tumor burden, while lweight reflects prostate size. The age feature denotes the patient’s age. The variable lbph indicates benign prostate enlargement, and svi (seminal vesicle invasion) shows whether cancer has spread beyond the prostate. The feature lcp represents capsular penetration, indicating tumor extension. The gleason score measures cancer aggressiveness, and pgg45 represents the proportion of higher-grade tumor cells. The lpsa variable reflects the prostate-specific antigen level, an important cancer biomarker. The target variable Target is binary, where 0 indicates low risk and 1 indicates high risk. Data preprocessing steps include: -Handling of class imbalance through resampling techniques -Data augmentation using controlled Gaussian noise to simulate real-world variability -Normalization-friendly feature structure -Random shuffling to eliminate ordering bias This dataset is suitable for: -Machine learning classification tasks -Clinical risk prediction modeling -Explainable AI (XAI) research -Oncology decision support systems Additionally, the dataset can support interdisciplinary research in: -clinical data science -cancer informatics -healthcare analytics All data are anonymized and contain no personally identifiable information.

Files

Categories

Machine Learning, Clinical Oncology, Cancer Diagnosis, Healthcare Research

Licence