Sociodemographic data on live births children, Brazil, 2018-2020
The dataset is an open data from the Sistema de Informação de Nascidos Vivos (SINASC), which is a system implemented by the Brazilian federal government in the 1990s, with the purpose of collecting data on all live births in the national territory. The system makes it possible to provide information on birth rates for all levels of the Brazilian health system, as well as the development of relevant indicators in the strategic planning of management to support the planning of actions, activities, public policies and programs aimed at health. The dataset is related to three years (2018, 2019 and 2020) of SINASC referring only to the state of Pernambuco, and it is composed of routine prenatal data, gestational history, sociodemographic data and data of newborns. born, including their weight. The pre-processed dataset has 10 attributes plus the target attribute ‘WEIGHT’, with 351,253 records, 29,625 low birth weight records and 321,628 adequate weight records. This dataset contains two CSV files: the first file “Dataset.csv” is the pre-processed dataset and the second “Attributes.csv” contains the description of each attribute.
Steps to reproduce
Data were extracted only from the state of Pernambuco, from 2018 to 2020, resulting in a dataset with 400,157 records and 61 attributes. To prepare the dataset, the records that the target attribute 'WEIGHT' resulted in Macrosomias (newborns weighing 4,000 grams or more) and empty values, with respectively 24,838 and 30 records, were initially discarded. Also all attributes that contained more than 70% of empty data and attributes that were related to postpartum, duplicated attributes, attributes that represented geographic environment codes and attributes of type date were discarded. An analysis of attributes of the dataset was carried out with the help of health specialists (stakeholders) from the Mãe Coruja Pernambucana Program (PMCP), which assists pregnant women through the Public Health System (SUS), before and after the birth of their children up to the age of five. Outliers were identified and excluded, referring to (i) mother's age greater than 56 years; (ii) have more than 11 living children; (iii) have more than seven deceased children; (iv) have more than five cesarean deliveries. The target attribute that was in grams was also modified and became binary; and the mother's occupation code attribute, which was numerical and became categorical. The final dataset, after pre-processing and cleaning processes, is composed of 351,253 records and 10 attributes plus the target attribute, of which 29,625 records correspond to low birth weight data and 321,628 normal birth weight records.