Enriched Traffic Datasets for Madrid

Published: 3 June 2024| Version 1 | DOI: 10.17632/697ht4f65b.1


-Description of the Research and Data: This work includes two datasets of traffic in Madrid, "DADAS" and "MLDAS", collected between June 2022 and February 2024. These datasets combine traffic sensor data, weather data, calendar data, road infrastructure data and localization data to facilitate detailed analysis of urban mobility and prediction of traffic patterns. -DADAS (Descriptive Analysis DAtaSet): "DADAS" is a dataset oriented towards descriptive analysis, capturing traffic intensity in 15-minute intervals from urban sensors and roads. Data processing includes meticulous cleaning and filtering to remove inconsistencies and outliers. Additionally, the k-means algorithm is used to group data from multiple sensors. -MLDAS (ML-oriented DAtaSet): "MLDAS" is designed for predictive analysis and modeling of traffic patterns. Derived from the "DADAS" dataset, this dataset has been processed to include temporal transformations and encoding of categorical and ordinal features, specifically preparing it for advanced machine learning applications. -Data usage: These datasets are crucial for infrastructure planning and sustainable traffic policies, providing valuable resources for researchers and urban planners interested in mobility studies and their environmental impact. For more details, see [Submitted to Data in Brief].


Steps to reproduce

Data collection process and methods summary for these datasets: -Traffic Intensity Data Collection: Traffic intensity data were obtained from the Madrid Open Data Portal. Initially, these data were stored in CSV files, selecting only the columns for sensor ID, date and time of the record, and traffic intensity. -Sensor Location Data: Detailed data on the geographical position of the sensors were collected from the same Madrid Open Data portal, adding these coordinates in Well-Known Text (WKT) format to facilitate integration with other geospatial data. -Labor Calendar Data: Madrid's labor calendar data, including workdays, holidays, Sundays and Saturdays, were integrated, also obtained from the Open Data Portal. This step is crucial for analyzing how traffic patterns vary according to the type of day. -Meteorological Data: Climatic variables such as temperature, precipitation, and wind were incorporated, aligning these observations with traffic records by date to analyze the influence of weather on traffic. -Road Information Data: Road information from OpenStreetMap processed through OSMnx was used to enrich traffic data with information about road infrastructure. This included transforming the data into a GeoDataFrame and applying a KDTree for nearest point search on the road network, linking each traffic record with a specific location on the road network. -Data Optimization and Cleaning: Advanced techniques were applied to clean and organize the data before analysis. This included removing outliers, eliminating duplicate values, and reorganizing records into hourly intervals. Additionally, the k-means algorithm was applied to segment the data into 300 groups based on characteristics such as mean, median, and standard deviation of traffic intensity. -ML-oriented Dataset Generation (MLDAS): From the descriptive dataset (DADAS), columns were refined and transformed for machine learning analysis preparation. This included trigonometric encoding for time features, standardization of numerical attributes, one-hot encoding for categorical features, ordinal encoding for ordinal features, and handling passthrough features.


Universidad de Zaragoza


Data Analysis, Calendering, Spain, Weather, Data Processing, Traffic Congestion, Road Network


Agencia Estatal de Investigación


Gobierno de Aragón