Clinical cases of Dengue and Chikungunya
This data set presents clinical and sociodemographic information of confirmed patients of Dengue and Chikungunya, as well as patients cases discarded from these same diseases. The data were accounted for two databases, the first is from the Health Problem and Notification Information System, from Portuguese Sistema de Informação de Agravo de Notificação (SINAN), that occurred in the state of Amazonas, from 2015 to 2020; The second if from Dados Recife, an open data portal of the city Recife, in the state of Pernambuco, also from 2015 to 2020. The data set has 17,172 records and 27 attributes. This data set contains four CSV files, the "data set.csv" file contains the pre processed data set itself, the "attributes.csv" file contains information about each attribute present in the data set, the files “sinan-db.csv” and “recife-db.csv” contains the original data set.
Steps to reproduce
Data regarding Dengue and Chikungunya notifications from the state of Amazonas and the city of Recife, Pernambuco from 2015 to 2020 are used. Regarding the state of Amazonas, data was retrieved from the SINAN. SINAN is the official system for disease reporting in Brazil. Diseases from the national list of compulsory notification must be reported, and this list includes Dengue and Chikungunya. This data set contains 57,445 entries and 146 variables and hereafter is referred to as “SINAN-db”. The data set for Recife was retrieved from an open data set named Portal de Dados Abertos do Recife, maintained by the Recife Health Department, whose primary source is also the SINAN, and therefore it follows the same dictionary pattern, and allows integration without further issues. This data set contains 83,073 registers and 124 variables and is referred to as “Recife-db” in this work. First, the output classes were grouped into three distinct classes, filled in the CLASSI_FIN attribute: DENGUE: Patients with confirmed Dengue; CHIKUNGUNYA: Patients with confirmed Chikungunya; and OTHERS: Patients classified as “inconclusive” or “negative” for both Dengue and Chikungunya. Only records confirmed or denied by clinical diagnoses were selected. Registers that did not relate signs or symptoms were discarded since they are the most important information for classification models. Moreover, variables with more than 50% of data missing were also removed. Besides the original variables, a new one (DIAS) was created so that the time (in days) from the onset of these symptoms to the date of notification could be added to the models. For the selection of attributes, specialists were consulted. After coding variables as numbers, duplicates were removed, and missing values were replaced by “not informed” for each variable. Registers with missing values for all variables were also removed. Finally, the clean data set consisted of 17,948 registers in the DENGUE class, 5,724 in the CHIKUNGUNYA class, and 16,704 in the OTHERS class, totaling 40,376 registers with 27 variables. In data science, a higher number of registers of a specific class compared to another in the same data set is known as imbalance and it can bias the ML model, which favors the classification of the class that has the largest number of registers. In order to balance the data set, the random undersampling technique was performed. In this technique, the class with the least number of registers defines the amount of the other classes, so that all classes have the same number of registers. After balancing, the data set still had 27 attributes and 17,172 records, with 5,724 for each of the three classes.