Heart attack risk prediction: a more productive, reduced, and cleansed ARFF dataset for machine learning

Published: 14 August 2024| Version 1 | DOI: 10.17632/n6293b4ggc.1
Contributors:
Gennady Chuiko, Olga Yaremchuk

Description

The authors utilized the Cleveland Heart Disease Dataset from the University of California, Irvine (UCI) Machine Learning Repository (https://doi.org/10.24432/C52P4X). This dataset, dating back to 1988, combines four separate datasets. It consists of 13 attributes and one additional target variable, including 303 instances, and it contains no missing values. We used relevance to select and rank attributes, which allowed us to pick a subset with seven attributes and a target variable, resulting in a smaller dataset (254 instances at seven attributes). Primary datasets often require updates in clinical practice, which involve thorough data reduction and denoising. This data engineering process typically enhances the predictive power of the datasets, especially when using the classification algorithm known as IBk (Aha, D.W., Kibler, D. & Albert, M.K. Instance-based learning algorithms. Mach Learn 6, 37–66 (1991). https://doi.org/10.1007/BF00153759) which we applied to the reduced dataset. For instance, Accuracy has risen to 97.2% from 88.4%, and other performance indicators have shown statistically significant improvement.

Files

Steps to reproduce

Dataset is in ARFF format habdy for WEKA software.

Institutions

Chornomors'kyj derzhavnyj universytet imeni Petra Mohyly

Categories

Machine Learning, Heart, Health

Licence