Brazilian datasets classified to support differential diagnosis of Severe Acute Respiratory Syndrome (SARS) caused by COVID-19 and influenza
Description
The SIVEP-Gripe database contains 3,395,398 records with 166 attributes, covering the years 2020 to 2022. These records document cases of Severe Acute Respiratory Syndrome (SARS) caused by COVID-19, Influenza, other etiological agents, various respiratory viruses, and unspecified cases. Of the total records, 1,872,106 are related to SARS due to COVID-19, and 21,490 are related to SARS due to Influenza, highlighting the need for class balancing. Four datasets were created with different balancing configurations: * Balanced by age range (1BAR): The majority class was reduced to match the number of records in the minority class, based on age ranges. Specifically, records from the majority class were selected to match the minimum and maximum age ranges of the minority class. * Balanced by age, sex, and same distribution (2BASD): For each record in the minority class, an equal number of records with the same sex and age were selected from the majority class. * Balanced by age, sex, region, and same distribution (3BARD): This approach included balancing by region, in addition to age and sex. * Balanced by age, sex, outcome, and same distribution (4BASED): This method balanced records by age, sex, and outcome (recovery or death) to maintain consistent distributions of these factors across both classes. After preprocessing, all datasets retained 24 attributes and one target class, "classi_fin", where 1 represents SARS due to influenza and 5 represents SARS due to COVID-19. These subsets were created to evaluate the performance of machine learning models during training.
Files
Steps to reproduce
Available in the image "Preprocessing steps.png"