Hybrid Prophet-XGBoost Forecasting for Urban Air Quality with Real-Time Data Integration For Chennai

Published: 30 June 2025| Version 1 | DOI: 10.17632/br5dyxm7c6.1
Contributors:
,
,
,

Description

This research aims to forecast short-term urban air quality by integrating meteorological variables and pollutant concentration data into a hybrid machine learning framework. The hypothesis is that using both historical pollutant data and weather features enhances predictive accuracy across different pollutant types. Historic pollutant data (2022–2024) was collected from the WAQI (World Air Quality Index) historic dashboard, cleaned, and imputed for missing or anomalous values. This dataset includes daily values for PM₂.₅, PM₁₀, NO₂, SO₂, and O₃ from six key monitoring stations in Chennai. Supplementary weather and CAMS air quality data were sourced using the Open-Meteo APIs: https://api.open-meteo.com/v1/forecast for daily weather https://air-quality-api.open-meteo.com/v1/air-quality for CAMS-based pollutant history These datasets were combined and processed to ensure daily granularity, using linear interpolation for imputation and z-score-based outlier capping. The air quality forecast was generated for a 7-day horizon using a hybrid approach: Prophet (for seasonality), XGBoost (for lag-based learning), and ETS (for fallback modeling). AQI was computed based on Indian CPCB sub-index breakpoints, identifying both the AQI level and dominant pollutant each day. To evaluate performance, back-testing was done on actual vs. predicted pollutant values. Results show high model accuracy, particularly for gaseous pollutants. Based on MAPE and sMAPE: O₃: 95.22% accuracy SO₂: 93.04% accuracy PM₁₀: 80.28% accuracy PM₂.₅: 76.78% accuracy NO₂: 65.74% accuracy The model successfully captured pollutant dynamics and generated reliable AQI forecasts, suitable for environmental monitoring and early warning applications.

Files

Steps to reproduce

Collect Historic Data (2022–2024): Download station-level air quality data (PM₂.₅, PM₁₀, NO₂, SO₂, O₃) from the WAQI historic dashboard and save as CSV files. Fetch External Data Using Open-Meteo: Use the Open-Meteo API to collect daily weather data and CAMS air quality data for the same date range and station coordinates. Clean & Merge Data Handle missing values with linear interpolation Cap outliers using z-score threshold (3.0) Ensure daily frequency and merge pollutant + weather datasets Train Forecast Models: Apply Prophet, XGBoost, and ETS to each pollutant using 90 days of history. Use weather and CAMS pollutants as external features. Forecast AQI: For each forecasted day, compute sub-indices and determine AQI, status (e.g., Moderate), and dominant pollutant. Evaluate Accuracy: Use accuracy.py to calculate MAPE, sMAPE, RMSE, and R² for each pollutant. Save results to CSV for reproducibility. Visualize and Report: Generate plots and a PDF summary comparing predicted vs. observed data and highlighting forecast trends, AQI, and health status.

Categories

Artificial Intelligence, Machine Learning, Air Pollutant, Air Pollution Modeling, Urban Air Pollution

Licence