Classification of Heart Failure Using Machine Learning: A Comparative Study
Description
Our research demonstrates that machine learning algorithms can effectively predict heart failure, highlighting high-accuracy models that improve detection and treatment. The Kaggle “Heart Failure” dataset, with 918 instances and 12 key features, was preprocessed to remove outliers and features a distribution of cases with and without heart disease (508 and 410). Five models were evaluated: the random forest achieved the highest accuracy (92%) and was consolidated as the most effective at classifying cases. Logistic regression and multilayer perceptron were also quite accurate (89%), while decision tree and k-nearest neighbors performed less well, showing that k-neighbors is less suitable for this data. F1 scores confirmed the random forest as the optimal one, benefiting from preprocessing and hyperparameter tuning. The data analysis revealed that age, blood pressure and cholesterol correlate with disease risk, suggesting that these models may help prioritize patients at risk and improve their preventive management. The research underscores the potential of these models in clinical practice to improve diagnostic accuracy and reduce costs, supporting informed medical decisions and improving health outcomes.
Files
Steps to reproduce
A dataset titled “Heart Failure” available on Kaggle was used. The set includes 918 patient records and 12 relevant features, such as age, cholesterol, and blood pressure, crucial factors for heart disease prediction. Outliers were removed and data were binarized, to improve data distribution and consistency, avoiding bias in the model. Analysis techniques, such as histograms, box plots, and density matrices, were used along with a correlation matrix to identify significant relationships between variables and heart disease risks. For classification, five machine learning algorithms were applied: Logistic Regression, Decision Tree, Random Forest, K-Nearest Neighbors, and Multilayer Perceptron, implemented in Python using the Scikit-Learn, Pandas, NumPy, Matplotlib, and Seaborn libraries. Each model was evaluated using cross-validation, hyperparameter optimization, and precision, recall, and F1-score metrics, with Random Forest being the most effective model with an accuracy of 92%. This methodology ensures that the study can be easily replicated by downloading the data from Kaggle and following the same preprocessing, model selection, and evaluation steps, thus contributing to validating and improving prediction models in the diagnosis of heart failure in the healthcare field.