Arbovirus clinical data, Brazil, 2013–2020
This data set presents clinical, sociodemographic, and laboratory information of confirmed patients of Dengue and Chikungunya, as well as patients cases discarded from these same diseases. The data were accounted for by the Health Problem and Notification Information System, from Portuguese Sistema de Informação de Agravo de Notificação (SINAN), that occurred in Brazil, from 2013 to 2020. The data set has 7,632,542 records and 56 attributes. This data set contains four CSV files, the "data set.csv" file contains the pre-processed data set, the "attributes.csv" file contains information about each attribute present in the data set, and the "dengue.csv" and "chikungunya.csv" files contain the data set originated from SINAN without the application of any pre-processing technique.
Steps to reproduce
The data were collected from the Health Problem and Notification Information System, from Portuguese Sistema de Informação de Agravo de Notificação (SINAN), which has records of patient notifications with a diagnosis of disease present on the national list of compulsory notification of diseases, injuries, and public health events, as is the case of Dengue and Chikungunya. The data collected contains notifications of Dengue and Chikungunya cases that occurred in the Brazilian territory, between 2013 and 2020. Data referring to Dengue patients contain clinical information (pre-existing symptoms and comorbidities), laboratory tests performed, and socio-demographic data for each patient. However, data regarding Chikungunya cases contain only socio-demographic information. Although the Chikungunya data set officially does not have any clinical and laboratory information, we found about 100 records with this information. Possibly, these records were treated as suspected cases of Dengue and therefore were recorded in the Dengue data set, and only later were confirmed as cases of Chikungunya. Therefore, it is possible to observe some clinical and laboratory data for Chikungunya patients as well. Finally, no sensitive patient information is available. About the pre-processing of the data, first, the SINAN data from all states were unified, resulting in 13,421,230 notifications and 118 attributes. The records were grouped into three distinct groups, located in the CLASSI\_FIN attribute: "Dengue", "Chikungunya", "Discarded/Inconclusive". Only notifications that were confirmed or discarded/inconclusive through laboratory tests were selected. After this step, the attribute used for the filter (CRITERIO) was also removed, since it now contains only a single value. The attribute TP_NOT identifies the type of notification generated; as all notifications are of the "Individual" type, this attribute has the same value for all records. Attributes that had more than 60\% null data or that were not in the original data dictionary were also removed. Attributes that still had null fields were filled with the default value referring to “not informed” of each attribute, according to the dictionary. The transformation from categorical to numerical data was also carried out. At the end of the process, the data set consisted of 4,307,513 records for Dengue, 325,000 records for Chikungunya and 2,100,029 records for Discarded/Inconclusive.