A Brazilian dataset of symptomatic patients for screening the risk of COVID-19

Published: 27-01-2021| Version 4 | DOI: 10.17632/b7zcgmmwx4.4
Íris Viana dos Santos Santana,
Andressa C. M. da Silveira,,
Alvaro Sobrinho,
Lenardo Chaves e Silva,
Leandro Dias da Silva,
Danilo Freire de Souza Santos,
Edmar Candeia,
Angelo Perkusich


The original Brazilian COVID-19 dataset (from the 26 Brazilian states and the Federal District) included information about tested patients, containing early-stage symptoms, comorbidities, demographics information, and symptoms description. The patients were tested by applying viral or antibody tests. We preprocessed the dataset by selecting only completed tests, being marked as positive or negative, applied string matching algorithms to correct some inconsistencies, and removed rows with duplicated instances and asymptomatic patients. We also focused on the most frequent and relevant demographics information and reported early-stage symptoms to select features, and balanced the data considering positive and negative cases by random undersampling using the NearMiss algorithm. The preprocessing resulted in a dataset with 2,674 patients. The reduction in the number of patients from 55,676 to 2,674 occurred due to the asymptomatic patients, duplicated data, few reported symptoms by some patients, and the need for information about the dates of symptoms onset and testing. Using this dataset, we implemented and evaluated supervised machine learning models for COVID-19 detection in Brazil based on early-stage symptoms and basic personal information. This dataset relates to the study entitled "Machine Learning Models for COVID-19 Detection in Brazil Based on Symptoms".