Daily bus transportation demand - São Paulo/Brazil - Jan/2017 to Jun/2022

Published: 15 September 2022| Version 1 | DOI: 10.17632/h3m49cm8mt.1


Daily bus demand data for all bus lines in São Paulo/Brazil - from Jan/2017 to Jun/2022.


Steps to reproduce

The raw demand data was obtained from SPTRANS, the public company responsible for bus transportation in São Paulo. São Paulo's greatest public transportation network is the bus network, which is mostly managed by public company SPTRANS. SPTRANS provides bus transportation services during business hours and night shifts, along with specific services directed towards impaired and disabled citizens who need support in traveling to hospitals and health centers. SPTRANS makes available online daily mobility data for all bus lines in São Paulo, providing an enormously valuable resource for data mining, urban geography studies, and transportation planning. By using this data, it is possible to effectively study the impacts of pandemics on São Paulo's bus transportation network and gain insights into the regional differences and if different kinds of lines were affected differently. The data is distributed in xlxs format and contains information about line name, demand by kind of users, such as users who pay with money, users who use travel cards, elderly users (who do not need to pay for travel), and total demand. The line name is a string containing a code made of letters and numbers. Demand data between January/2017 and June/2022 were automatically downloaded and accessed by a script in python, using pandas library for creating dataframes. There were significant challenges in using SPTRANS data, such as the irregular formatting of row and column names through the years, inconsistent abbreviations of keywords, such as terminal stations and metro stations (i.e "term.","terminal","metr","m", etc). These issues were solved by capturing the code and crossing the data with General Transit Feed Specification (GTFS) data, which is provided by SPTRANS as well on an almost weekly basis and contains information about route and schedule of lines for a given time period. This step enriched the data with standardized names and route points. Another difficulty in obtaining the data through automatic scripts is that Links do not have regular naming patterns. A significant number of downloaded XLSX files came with a bad configuration so that empty rows are read by pandas as filled with content, triggering an error related to a prohibitively large number of rows. This bug was solved by simply individually opening it and re-saving it. The details about the origin of this error are not understood by the authors. All the bus lines were enriched with geospatial information, containing all points belonging to each route, extracted from GTFS. After all these steps, the data was organized in a matrix where each row represents a day and each column represents a line. The geospatial information was stored separately. And since bus lines are created or deactivated with some regularity in São Paulo, The choice was to keep in the sample only the lines that remained active throughout the observation period.


Universidade Nove de Julho


Transportation by Region, Brazil, Transportation Demand