Tweets with traffic-related labels for developing a Twitter-based traffic information system.

Published: 23-03-2018| Version 1 | DOI: 10.17632/c3xvj5snvv.1
Sina Dabiri


This data contain tweets that have been collected through Twitter search API. Each tweet has been classified into one of the three categories: 1) Non-Traffic (NT)-Class 0: Any tweet that does not fall into the other two categories is labeled as NT. 2) Traffic Incident (TI)-Class 1: This type of tweet reports non-recurring events that generate an abnormal increase in traffic demand or reduces transportation infrastructure capacity. The examples of non-recurring events include traffic crashes, disabled vehicles, highway maintenance, work zones, road closure, vehicle fire, traffic signal problems, special events, and abandoned vehicles. Since the ultimate goal of our framework is to inform users and agencies on the occurrence of a traffic incident in a real-time basis, if a tweet reports on the clearance or re-opening of roads that had already been affected by non-recurring traffic events, that tweet is classified as TCI, the third tweet category. Indeed, such tweets are providing information on the current status of the network rather than informing an ongoing traffic incident. 3) Traffic Conditions and Information (TCI)-Class 2: This type of tweet reports traffic flow conditions such as daily rush hours, traffic congestion, traffic delays due to high traffic volume, and jammed traffic. Also, any tweets that disseminate new traffic rules, traffic advisory, and any other information on transport infrastructures (e.g., new facilities or changing the direction of a street) are classified as TCI. Two types datasets are available: (1) 2-class dataset, in which tweets are categorized into traffic-related tweets (i.e., TI and TCI) and non-related-traffic tweets (i.e., only NT). 2) 3-class dataset, in which tweets are categorized into three groups including NT, TI, and TCI. Each type has its own training and test sets. Each csv file has three columns. First Column: Tweet class number according to the above definition. Second Column: Tweet id fetched from the Twitter API. Note that the 's' character should be removed. Third Column: The tweet text, that is used for analysis. Other attributes of each tweet (e.g., user’s screen_name, UCT time when the tweet is created, tweet’s unique ID, the geographic location of the tweet when posted) can be retrieved using the tweet id.