Air quality dataset
Description
This dataset contains time-series data of PM10 (and PM2.5) concentrations with meteorological covariates, recorded on an hourly basis from 21 different air quality measurement stations. The measurements were collected as part of an air pollution monitoring effort and are intended to support research on time-series forecasting, air quality analysis, and environmental data modeling. The dataset contains hourly observations from 21 monitoring stations (one CSV file per station) over a shared measurement window: - Time span (all stations): 2024-05-02 20:00:00 to 2025-12-11 07:00:00 - Records: 14,100 timestamps per station, 296,100 rows total - Timestamp grid: hourly; the provided time index is complete (max observed inter-sample gap = 1 hour) - Primary target for forecasting experiments: PM10, PM2.5 Missingness is primarily value-level (NaNs in variables), not missing timestamps. PM10 missingness is low overall (~1.66%), while wind direction is frequently missing (~49%). Each CSV corresponds to one monitoring station, identified by the file stem (e.g., station `E421` is in `E421.csv`). Data format - Format: CSV (comma-separated values) - Encoding: UTF-8 (note: `clouds` values contain Slovene characters) - Time column: `datetime` in the format `YYYY-MM-DD HH:MM:SS` - Missing values: empty fields are interpreted as missing; numeric missing values should be parsed as NaN Variables (columns) Each station file contains the same columns: - `datetime` (string/datetime): hourly timestamp (timezone not explicitly encoded) - `PM10` (float): PM10 concentration (unit as provided by the source; typically µg/m³) - `PM2.5` (float): PM2.5 concentration (unit as provided by the source; typically µg/m³) - `temperature` (float): air temperature (unit as provided by the source; typically °C) - `rain` (float): precipitation amount/intensity (unit as provided by the source) - `pressure` (float): surface pressure (unit as provided by the source; typically hPa) - `precipitation` (float): percentage-valued meteorological covariate in the range 0--100 (semantics depend on upstream provider; often interpreted as relative humidity or a probability-like indicator) - `wind_speed` (float): wind speed (unit as provided by the source) - `clouds` (string): categorical sky condition (Slovene labels), one of: - `jasno` (clear) - `delno oblačno` (partly cloudy) - `pretežno oblačno` (mostly cloudy) - `oblačno` (cloudy) - `wind_direction` (string): categorical wind direction using Slovene abbreviations (often missing). Observed categories: - `S` (North), `SV` (NE), `V` (East), `JV` (SE), `J` (South), `JZ` (SW), `Z` (West), `SZ` (NW)