Air quality monitoring
Description
This airquality.csv file has 5,999 rows and 5 numeric columns—PM2.5, co2, no2, so2, and o3—with no missing values and no duplicate rows. The variables look like pollutant concentrations, each showing distinct spread: PM2.5 has a median of 18 with an interquartile range (IQR) 11–28 (range 3–48); co2 is the most variable with a median 1,183 and a long right tail (IQR 625–4,093, range 40–6,999); no2 centers at 48 (IQR 27–174, range 5–300); so2 at 59 (IQR 35–229, range 1–400); and o3 at 123 (IQR 77–167, range 10–250). In short, it’s a clean, fully numeric pollution dataset with notable dispersion—especially in co2 and so2—well-suited for quick EDA (distributions, outliers, correlations) or modeling once you decide on a prediction target.
Files
Steps to reproduce
Methods / Data-Provenance outline for the air-quality study: Study setting & deployment. We installed IoT air-quality nodes across urban locations (Dhaka), each measuring PM2.5, CO₂, NO₂, SO₂, and O₃ in real time. Sensors stream to a central server for processing. Instruments (bill of materials). – CO₂: MH-Z19B (NDIR) to quantify CO₂ via infrared absorption. – NO₂ / SO₂: electrochemical gas sensors producing current proportional to concentration. – O₃: MOS or UV-absorption sensor. – Controller: Arduino Uno for signal conversion/ingest and on-node checks. Acquisition protocol. Nodes continuously sample pollutants and push data; the controller digitizes/validates readings, and the power system ensures uninterrupted acquisition. Transport & networking. Sensor data are published via MQTT to a cloud broker; depending on site, nodes use Wi-Fi, LoRa, or cellular links to reach the broker. Storage & pipeline. Data arriving at the broker are persisted in a cloud database and fed into the analytics pipeline. Pre-processing. We apply noise filtering, imputation for missing values, outlier handling against pollutant thresholds, and feature normalization before modeling. Modeling & software workflow. Cleaned data drive supervised ML models—XGBoost, CatBoost, Gradient Boosting, and SVM—for prediction, hotspot detection, and trend forecasting. (Any equivalent implementations can reproduce the pipeline.) Visualization & alerting. Results render to an interactive dashboard with pollution maps; threshold-based alerts notify users/authorities when levels exceed safe bounds. Reproducibility checklist. (i) Specify sensor models and calibration steps; (ii) publish firmware/config for Arduino + MQTT topics/QoS; (iii) document broker URL, auth, and network (Wi-Fi/LoRa/cellular) settings; (iv) share schema for cloud DB; (v) release pre-processing code (filters, imputation, normalization) and training scripts for the four ML models; (vi) export dashboard definitions and alert thresholds.