Air pollution and traffic flow dataset
Description
The following dataset emerged from the hypothesis that traffic flow data, being one of the main sources of atmospheric pollutant emissions, could improve the accuracy of their forecast. To verify the hypothesis, air pollutants data and meteorological data from the official Mexico City Govermnet page (http://www.aire.cdmx.gob.mx/) was downloaded, pre-processed and joined with traffic flow data from TomTom (https://www.tomtom.com/). The atmospheric data is public and was validated by the Mexican government through the CAME (Enviromental Comission of the Megalopolis). It includes variables such as temperature, relative humidity, wind speed and direction, SO2, NOx, NO2, NO, O3, PM (2.5 and 10), and CO. The measurements comes from two stations in Mexico City, Mexico. One of the stations is identified as "MER station" and it is located at a latitude of 19.424610, a longitude of -99.11959419.325146 and an altitude of 2245 m a.s.l, and the other station is identified as "UIZ station" and it is located in the latitude of 19.360794, the longitude of -99.073880 and an altitude of 2221 m a.s.l. The dataset also includes traffic flow data from the nearest street to the atmospheric measurement station, consulted in real time, every 15 minutes, for nearly three months (100 days). The traffic flow value represents the speed relative to free flow, that is, the difference between the speed at that moment in the street segment and the free flow speed. Its values range from 0 to 1, where 0 indicates that the flow is free and 1 that the flow is completely stopped. The information is separated into two folders: "Raw data" contains the unprocessed traffic and meteorological station information, as retrieved from their respective sources, in the period between February 23, 2024, and May 31, 2024. The "Data after pre-processing" folder contains the dataset (by meteorological station) after a cleaning process where missing data was removed, the data was normalized (using the Min-Max formula), and the traffic flow was averaged per hour to relate it to the atmospheric and meteorological data, at the respective hour, for its corresponding measurement station. The data after pre-processing (without the timestamp) was used to train machine learning regression models to forecast the pollutants O3, CO, and NO2 (after removing this variable in its respective experiment), for their corresponding station. To verify the hypothesis, experiments were conducted on training the regression models with and without traffic data. The results showed a slight improvement in forecast accuracy by including the traffic data, after evaluating the models with the R2-adjusted, RMSE, and MAE metrics.