A Brazilian dataset of symptomatic patients for screening the risk of COVID-19

Published: 08-03-2021| Version 5 | DOI: 10.17632/b7zcgmmwx4.5
Íris Viana dos Santos Santana,
Andressa C. M. da Silveira,,
Alvaro Sobrinho,
Lenardo Chaves e Silva,
Leandro Dias da Silva,
Danilo Freire de Souza Santos,
Edmar Candeia,
Angelo Perkusich


The original COVID-19 dataset included information about tested patients, containing early-stage symptoms, comorbidities, demographics information, and symptoms description. The patients were tested by applying viral or rapid tests. The raw data was collected by the public health agency of the city of Campina Grande, Paraíba state, in Northeast Brazil. Such a public agency is informed by all the COVID-19 exams performed in the city of Campina Grande. The health agency employees removed patient identification, and the data made available were reused to enable this study. We preprocessed the dataset by selecting only completed tests, being marked as positive or negative, applied string matching algorithms to correct some inconsistencies, and removed rows with duplicated instances and asymptomatic patients. We also focused on the most frequent and relevant demographics information and reported early-stage symptoms to select features, and balanced the data considering positive and negative cases by random undersampling using the NearMiss algorithm. We also use unbalanced datasets. Using this dataset, we implemented and evaluated supervised machine learning models for COVID-19 detection in Brazil based on early-stage symptoms and basic personal information. This dataset relates to the study entitled "Machine Learning Classification Models for COVID-19 Test Prioritization in Brazil".