Treffer: A geospatial workflow for the assessment of public transit system performance using near real‐time data
1361-1682
Weitere Informationen
This article presents the development of a Geographical Information Systems (GIS) workflow that harvests high‐volume and high‐frequency near real‐time data from a public General Transit Feed Specification (GTFS) and calculates metrics for the assessment of on‐time and route speed performance for a public transit system. The approach is applied to near real‐time and static GTFS data collected over a 9‐month period for the City of Calgary, Alberta, Canada. The workflow uses two Azure Virtual Machines (VMs), one to harvest the data and the other to process observations in parallel using Python and the ArcGIS API libraries. A Web GIS application is described that queries data from MongoDB to visualize the performance results in spatiotemporal form. The purpose of the workflow and Web GIS application is to provide actionable information to transit planners to improve public transportation systems. The data management and analysis workflow is transferable to similar GTFS data from other cities.
AN0157689751;7ql01jun.22;2022Jun30.06:47;v2.2.500
A geospatial workflow for the assessment of public transit system performance using near real‐time data
<sbt id="AN0157689751-2">INTRODUCTION</sbt>This article presents the development of a Geographical Information Systems (GIS) workflow that harvests high‐volume and high‐frequency near real‐time data from a public General Transit Feed Specification (GTFS) and calculates metrics for the assessment of on‐time and route speed performance for a public transit system. The approach is applied to near real‐time and static GTFS data collected over a 9‐month period for the City of Calgary, Alberta, Canada. The workflow uses two Azure Virtual Machines (VMs), one to harvest the data and the other to process observations in parallel using Python and the ArcGIS API libraries. A Web GIS application is described that queries data from MongoDB to visualize the performance results in spatiotemporal form. The purpose of the workflow and Web GIS application is to provide actionable information to transit planners to improve public transportation systems. The data management and analysis workflow is transferable to similar GTFS data from other cities.
Essential services are defined as basic necessities that are accessible to the general public, and that contribute to their well‐being and convenience. Examples of essential services include healthcare, response units (e.g., police, fire, EMS), clean water, utilities, sanitation, communications, vital goods (e.g., fuel, groceries), and transportation (Pan American Health Organization, 2016). Together, these interconnected systems form an infrastructure that sustains modern life. The COVID‐19 pandemic has revealed the highly fragile nature of such essential services and has reminded societies around the world that, in the face of adversity, system adaptation and resilience are imperative (OECD, 2021; Sneader & Lund, 2020). One important infrastructure component is public transit (Luo, Gee, Piccoli, Work, & Samaranayake, 2022).
All public transit agencies strive to provide a good quality of service to the populations they serve by enabling passengers to travel smoothly at an affordable fare (Desaulniers & Hickman, 2007). However, achieving this goal is a challenge that must maintain a delicate balance between asset and operational costs, while achieving strategic, operational, and tactical planning goals (Desaulniers & Hickman, 2007). Strategic planning involves maximizing service quality under budgetary constraints, whereas operational concerns focus on achieving service delivery targets while minimizing costs. Tactical planning, on the other hand, seeks to improve service quality by resolving the details of transit services, such as route definition, the frequency of service, and setting departure and arrival schedules (Desaulniers & Hickman, 2007). In the face of these challenges, many transit agencies are seeking solutions to make their infrastructure more resilient and adaptable. One emerging approach to this is the development and implementation of information and communication technologies (ICT) to manage and plan the replacement of aging infrastructure (Bueti & Faulkner, 2013; Pee & Pan, 2021; United Nations, 2015). The approach discussed in this article addresses both of these needs. Specifically, it develops a replicable workflow that harvests and analyzes routinely collected and publicly accessible urban bus service data and presents these data in a web GIS application that allows current bus service efficiency and reliability to be assessed. The workflow, metric calculation and visualization support an assessment of system reliability that informs both current management, and future system planning. The longer‐term goal of this research is to provide a straightforward means for transit agencies to improve their existing transit network reliability and service provision to members of the public.
The article first reviews recent research on public transit services, highlighting the need for assessment of service efficiency and reliability. A replicable workflow is then presented to collect the data required for service assessment, using the city of Calgary as a case study. The processing steps in this workflow are explained in detail, along with the resulting datastore. The data are then analyzed to highlight system performance over both time and space. A visualization dashboard is presented that allows transportation planners to query system performance and quickly identify areas where reliability and efficiency of service performance are sub‐optimal for members of the public. The article concludes with a summary of its major contributions, and suggestions for future research. Documentation for the application and associated data processing source code in the form of a modifiable Python package is made available on Github via the URL presented in the conclusion.
LITERATURE REVIEW
Public transit (PT) systems are an essential service in modern cities. They provide an affordable mobility option for urban residents, reduce negative environmental impacts (i.e., traffic congestion), and have the potential to stimulate economic growth through improved access to commercial services (Cats & Jenelius, 2014; Wang, Xue, Zhao, & Wang, 2018). These positive contributions require that a PT system is efficient, robust, adaptable, and reliable. An efficient system must be affordable, convenient, accessible, and comfortable for transit users. Robustness requires resiliency from disturbances along the transportation network including infrastructure repair and vehicle malfunctions (Cats & Jenelius, 2014). Adaptability implies that the system is subject to evaluation and change to improve both its efficiency and robustness. Collectively, the attributes of efficiency, robustness, and adaptability also contribute to the overall reliability of a PT system.
Reliability and speed are, arguably, the most significant measures of transit performance for transit agencies and passengers (Hu & Shalaby, 2017). From a transit agency's perspective, having an unreliable and/or slow PT system negatively impacts ridership and operational costs (Hu & Shalaby, 2017; Perk, Flynn, & Volinski, 2008; Polus, 1978). This can lead to a negative feedback loop for PT systems by forcing passengers to seek alternatives, usually personal vehicles, which increase the number of cars on the road and exacerbates traffic congestion, further reducing the reliability of PT services (Arriagada, Gschwender, Munizaga, & Trépanier, 2019; Figliozzi, Feng, Lafferriere, & Feng, 2012; Tirachini, Godachevich, Cats, Muñoz, & Soza‐Parra, 2021). Additionally, poor reliability often leads to bus bunching (i.e., little to no gap between consecutive buses), excessive travel time, and passenger overcrowding in vehicles (Hu & Shalaby, 2017). Levinson (1991) supports this observation in noting that headway variation (i.e., less or more bus bunching) is the primary characteristic of an unreliable PT service.
Poor reliability of PT can also be caused by a combination of internal and external factors. Internal factors include the complexity of the system, characterized by factors such as sinuosity, the length of bus routes, the number of stops, and the spacing of boardings (Strathman & Hopper, 1993). External factors include weather, disruptions from on‐street parking, road maintenance, signal timing, and traffic congestion and incidents (Strathman & Hopper, 1993). Transit managers can improve PT reliability, and address both internal and external factors through effective decision‐making based on appropriate transit metrics and analysis (Cramer, Cucarese, Tran, Lu, & Reddy, 2009; Wood, 2015).
Effective implementation of ICTs in transportation to monitor performance and diagnose problems has the potential to make an existing PT network more efficient, robust, adaptable, and reliable without the need to alter the physical network itself (Battarra et al., 2018; Cohen‐Blankshtain & Rotem‐Mindali, 2016; Snellen & Hollander, 2017; Zhao, 1997). This use of ICTs is generally known as intelligent transportation systems (ITS), which can be subdivided into two categories, namely intelligent infrastructure systems (IIS) and intelligent vehicle systems (IVS). IIS focuses on operational perspectives of the system (e.g., PT operations, road management), whereas IVS focuses on the user's perspectives of the system (e.g., safety, productivity) (Anagnostopoulos, Anagnostopoulos, Loumos, & Kayafas, 2006; Cohen‐Blankshtain & Rotem‐Mindali, 2016). Both perspectives have the potential to alter mobility patterns by optimizing a PT system's capacity and influencing residents' travel behavior through the calculation of different forms of transit metrics (Cohen‐Blankshtain & Rotem‐Mindali, 2016; Morfloulaki, Myrovali, & Kotoula, 2015). The workflow presented in the next section focuses on the IIS component of an ITS, in particular, its impacts on transit reliability.
Transit metrics can be derived from a variety of data sources including automatic vehicle location (AVL) tracking, automated passenger counts (APC), automatic fare collection (AFC) devices, real‐time passenger information (RTPI) at bus stops and on‐board, and vehicle operational data gathering and archiving systems (Politis, Papaioannou, Basbas, & Dimitriadis, 2010). AVLs are commonly used for analysis purposes given the relative ease to install global positioning system (GPS) devices on PT vehicles. Vehicle GPS allow the capture of real‐time spatiotemporal data that can be used to assess reliability under different traffic conditions (Harsha, Mulangi, & Kumar, 2019; Mazloumi, Currie, & Rose, 2009). Additionally, AVL can be combined with RTPI systems, to improve passenger satisfaction by minimizing actual wait times and providing passengers with easily accessible information on projected vehicle arrival and departure times (Politis et al., 2010).
Based on these factors, many research studies have used AVL data capture to explore transit reliability. For instance, Harsha et al. (2019) used AVL to analyze travel time distribution at different spatiotemporal scales. Mesbah, Lin, and Currie (2015) acquired a historical AVL dataset to examine the effect of weather conditions on the travel time reliability of Melbourne's streetcar network. Chakrabarti and Giuliano (2015) found service reliability to be a significant indicator of patronage through the combined use of AVL and APC data for the Los Angeles Metro bus transit system.
Collectively, AVL has enabled transit analysts to measure PT reliability based on one or more of the nine common metrics given in Table 1 (after Currie, Douglas, & Kearns, 2012).
1 TABLECommon transit metrics to measure PT reliability
The dependence of the workflow described in the next section on AVL data and scheduled timetables limited the actual number of metrics that could be examined from Table 1 in this research. These include percentage of buses cancelled, percentage of service departing on‐time, percentage of service arriving on‐time, excess wait time, average lateness, and service variability indicators. Of these metrics, the percentage of vehicles arriving on‐time is particularly relevant since it is relatively easy to calculate and interpret and only requires AVL data capture and an arrival schedule, with no need to factor in passenger loads (i.e., boarding and alighting times). Additionally, on‐time performance provides a useful diagnostic for transit agencies to identify which stops on which routes need further assessment. Thus, on‐time performance is used as the measure of PT service reliability for this study.
Numerous previous studies have used GTFS to evaluate transit reliability metrics. For example, Bast, Brosi, and Storandt (2014) developed an efficient solution that displays the current state (e.g., delays, nearby vehicles, transit coverage) of 80 transit networks using GTFS data. Stewart, Diab, Bertini, and El‐Geneidy (2016) collected near real‐time GTFS data to generate performance measures based on their relevance to transit planners, including overlapping service areas, approximate location of each bus every 30 s, wait time at stops, and bus and bicycle‐share systems overlap. Wessel, Allen, and Farber (2017) describe a method that retroactively improves the accuracy of the GTFS using the NextBus API for the Toronto Transit Commission (TTC). This study was extended by Wessel and Farber (2019) to compare the actual versus expected operation of the TTC. Kunama, Worapan, Phithakkitnukoon, and Demissie (2017) developed a desktop tool that preprocesses GTFS data and visualizes animations of it on a transit network map, including the number of public transport vehicles per hour and the selection of routes, trips, and stations. Similarly, Prommaharaj, Phithakkitnukoon, Demissie, Kattan, and Ratti (2020) created a dynamic PT operation visualization tool to display mobility, speed, flow, density, headway, and analysis using static GTFS files provided by Calgary Transit.
This article extends this body of research by developing a replicable workflow tool for parallel processing of near real‐time GTFS data to improve efficiency. This article also documents the development of a web‐based visualization dashboard to interrogate PT operations dynamically, including per route and per hour of day selected over an adjustable date range. This visualization provides a relatively fine‐grained perspective on the functioning and reliability of the system, its constituent bus routes, and individual bus stops to aid transportation planners in assessing the causes of reliability problems. The workflow and visualizations are derived from raw GTFS data downloaded from Calgary Transit's open data feed, as described in the next section.
DATA
The GTFS data standard used in this article is commonly provided through public‐facing web portals by public transit agencies around the world to make their transit data accessible to researchers in an established format that comprises static and real‐time information. International examples of these GTFS feeds include the MBTA (https://www.mbta.com/developers/gtfs‐realtime) (Boston), STM (https://developpeurs.stm.info/documentation/gtfsrtv2) (Montreal), TfNSW (https://opendata.transport.nsw.gov.au/) (Sydney), and HSL (https://hsldevcom.github.io/gtfs%5frt/) (Helsinki). A comprehensive list of transit agencies that publish their GTFS data can be viewed here (https://transitfeeds.com/).
The static GTFS data streamed for public access by Calgary Transit is provided as eight delimited text files that include contact details of the transit agency, geographic location of the transit routes and stops, and schedule information (Table 2). With the exception of
2 TABLEStatic GTFS files (after Prommaharaj et al., 2020)
From the files described in Table 2,
3 TABLESummary of Calgary transit system per GTFS update
The near real‐time GTFS (GTFS‐RT) data (separate from the static GTFS data in Table 2) record frequent updates (i.e., every 30 s) of each transit vehicle's GPS location that includes information about the traffic the vehicle experiences. Traffic information can include travel speed, congestion level, vehicle stop status, occupancy status, location, and timestamp (Google Developers, 2021). The GTFS‐RT file is packaged in binary protocol buffer (.pb) format, which must be parsed and translated into XML, JSON, or other machine readable formats. The GTFS‐RT data provided by Calgary Transit include the positions (i.e.,
The workflow implemented to harvest and assemble these data into a form that allows reliability, as measured by on‐time and route speed performance, of the Calgary PT system is described in the next section.
DATA WORKFLOW
The most challenging aspect of working with GTFS data (real‐time and static) is establishing a reliable workflow that can harvest near real‐time and static GTFS files and estimate transit metrics, while controlling computation and technical overhead costs for the data collection, storage, and analytics. For ease of readability, the workflow shown in Figure 2 is split into three sections, namely data harvesting, data processing, and data storage.
Each component is managed in its own Microsoft Azure virtual machine (VM) with the latter chained dependently to the former based on scheduled tasks. For example, the data processing is scheduled to start at the end of the evening (e.g., 10:00 p.m.) once the data harvesting task is complete, and the data storage receives transit metric output when data processing is complete. Azure VMs were chosen for development purposes due to the flexibility of server type, range of database choices, ability to scale computational capacity (e.g., storage, RAM, and CPUs) on‐the‐fly, and the ability to set automated tasks (i.e., start up and deallocate).
Data harvesting
In the Azure portal, an automated task containing the data harvesting component is set to start up a dual‐core Ubuntu 20.04 VM (i.e., mini VM) once per day at 6:00 a.m. Mountain Time and shut down at 8:00 p.m. This time window minimizes runtime costs in the long‐term while capturing the most important travel periods (i.e., morning, afternoon, and evening) to allow analysis of spatiotemporal variations in performance of the PT system. Upon startup, a scheduled task initiates a custom Python script which harvests near real‐time GTFS data every 30 s and appends the return to a CSV file named for the date of collection (e.g., "GTFSRT_Calgary_2021‐6‐28.csv"). The most notable Python packages used in this code are
At the end of the harvesting period, the size of the CSV file on an average weekday is 72 MB or 600,000 observations of all available vehicle locations within the Calgary Transit network. Days on the weekend produced files about half that size. This aggregates to approximately 1.4 GB (i.e., 14.4 million rows) per month, or a total of 11 GB of raw data collected from the study period of January 2021 to August 2021. Before shutting down the VM, another task is scheduled to transfer the CSV file securely to another Azure VM for data processing using the
Data processing
The data processing step in Figure 2 seeks to optimize runtime performance. It is run in parallel to reduce runtime costs on a 96‐core Azure Ubuntu 20.04 VM with ArcGIS Server installed. This VM was chosen because it had the highest number of cores and RAM (384 GB) available in Azure to maximize parallel processing performance and to achieve the level of computing capacity required. To test the runtime difference, this component was tested on a 4‐core PC with hyperthreading, which took about 4 h to complete. The same task with parallel processing on the 96‐CPU VM completes within 20 min. This substantial runtime difference surpasses Amdahl's Law (Amdahl, 1967) in practice, since approximately 95% of the processing steps run in parallel.
The processing steps are shown in the right‐hand side of Figure 3. Specifically, these are: (1) Check static GTFS for updates; (2) Geoprocess vehicle locations; (3) Extract geographic information; (4) Quality Assurance (QA) and Quality Check (QC); (5) Enrich data; (6) Interpolate speed & arrival time; (7) Calculate metrics; and (8) Export to MongoDB. Steps 1 through 7 run in parallel.
Similar to the data harvesting component, an automated daily task is set to start up the 96 CPU VM at 8:00 p.m. Mountain Time. This initiates the entire data processing workflow, starting with an update check of the static GTFS files. This first step is crucial as it ensures that the transit metrics are calculated correctly by having the current date of the GTFS data aligned with the most recently scheduled (i.e., stop_times.txt) and its corresponding routes (i.e., shapes.txt, trips.txt) and stops (i.e., stops.txt) information (Table 2). The update check is completed by extracting the most recent date of the GTFS static files from a transit feed site (https://transitfeeds.com/p/calgary‐transit/238) provided by Calgary Transit and comparing it with the date of the harvested CSV file. For example, if the CSV file was collected on March 15th and the existing GTFS static files are from February 15th with the most recent update from the site remaining the same, then there is no need to update. On the other hand, if the most recent update from the site is March 13th, then these files are to be used until the next GTFS update is available.
The GTFS update process includes the creation of transit stops per route and individual routes as Esri shapefiles. Shapefiles were chosen rather than feature classes in a geodatabase because they can be created and accessed in parallel. For the individual routes, two shapefiles are created in Step 1 for the requirements of Steps 2–4 and 6. An undissolved file contains individual polyline segments that represent a path toward each transit stop for that route and a second file dissolves the segments from the first file to optimize processing efficiency in the geoprocessing step.
The geoprocessing vehicle locations step (Step 2) identifies the exact location for each
While these steps may initially seem to be redundant, they reduce the processing time of Step 2 by up to 80% as opposed to identifying directly which undissolved line the
4 TABLEData structure when extracting the vehicle's geographic information
Most transit networks will have bus routes with looped paths (e.g., cul‐de‐sacs and loops around a neighborhood route) that return back to the same path but travel in the opposite direction. Cases such as these in a GIS are problematic as they contain overlapping polylines and whenever a vehicle location is snapped to a loop path, it will output duplicate segments with different index values, as shown in Figure 5.
Without identifying and deleting the improper index values, the result can produce problematic outputs downstream in the workflow including miscalculated transit metrics. Step 4 (i.e., Quality Check and Quality Assurance) prevents this from happening by assessing the trend of the index order of the identified undissolved segments and omitting cases that are out of place (Table 5). In Table 5, the trending index order goes from 193 to 225.
5 TABLEExample of duplicate segment (in orange)
The fifth step (i.e., Enrich Data and append GTFS schedule) adds more features to the clean data structure. These features include the maximum index of the undissolved segments in the route, the maximum stop sequence, the number of stops remaining from the maximum stop sequence, arrival time and departure time from the
Step 6 calculates the projected vehicle speed and estimated bus arrival time per transit stop from the start of the recorded
The idea of the first sub‐step is to expand the current data structure between consecutive
The estimated travel time is added cumulatively from the recorded first point's timestamp to each stop it passes through (if applicable) until reaching the second point (2nd of the consecutive pair). The final sub‐step (i.e., Step 7) calculates the time delta between the estimated arrival time and expected arrival time provided by
6 TABLESample output of the data structure after interpolate process
As an example of the on‐time performance for a specific trip, in the 1st row of Table 6, TripID 57010620 was first recorded by the GTFS‐RT on 6/28/2021 at 19:22:00. The approximate location of the bus is identified as enroute to StopID 6025 with a remaining distance of 0.122 km. The next recording (3rd row) is used to calculate speed (Proj. Speed) based on distance traveled over time delta. Projected speed and the remaining distance are used to estimate the arrival time (Est. Arr. Time), which is 19:22:24. The expected arrival time (Exp. Arr. Time) for StopID 6025 is 19:21:00. Hence, the arrival time delta indicates the bus would arrive 84 s later than scheduled. The 2nd row shows the movement status (Mvm) set to "between" (Btwn.) because the vehicle passed by StopID 7563 with no recording. The first consecutive recording is set to a movement status (Stat) of 0. This accumulates by 1 for every nth consecutive recording.
After interpolation, the penultimate step (i.e., calculate metrics) calculates the actual headway and extracts the results of on‐time performance. The output for actual headway is computed by subtracting the expected arrival time interval (i.e., expected headway) between two different consecutive
7 TABLESample data structure of expected headway versus actual headway
Currie et al. (2012) noted that the Independent Transport Safety and Reliability Regulator (ITSRR) of New South Wales, the Department of Victoria, and Transport for London (TfL) each used an on‐time threshold of 2 min early and 5 min late. Whereas, 83 transit agencies in the United States adopted 1 min early and 5 min late thresholds. Hence, there is no global standard for on‐time performance, but rather it tends to be defined differently by different agencies.
For the purposes of this article, on‐time performance is evaluated from the interpolation process within a 2 to 5 min time frame by classifying each observation as early (>5 min), on‐time (−2 to 5 min), or late (<−2 min). Importantly, these thresholds can be easily changed to review the impact of different times on the same data. In this case, the evaluation results in two final shapefile outputs. The first is the percentage of on‐time buses per stop per hour per route, and the second is similar except aggregated per day. Formatted date, average speed and arrival time, count of trips (i.e.,
Data storage
GTFS‐RT data tend to come at high volume and high velocity and therefore the selected data store must be able to handle potentially huge amounts of data that arrive in a short period of time. Specifically, the selected data store needs to be horizontally scalable to support read/write operations in parallel as the size of the stored data and number of end users querying the data are both likely to grow over time. When a system is horizontally scalable, it supports adding more servers to the resource pool as needed. Although relational databases are vertically scalable, meaning they support increasing the processing power of a single server, most relational databases are not horizontally scalable.
Hence, when it comes to the need for a highly scalable distributed storage system, NoSQL data stores stand out. MongoDB is an open‐source NoSQL document‐based data store that is capable of handling large volumes of structured, semi‐structured, or even unstructured data efficiently. It provides a reliable solution through an implementation known as a replica set. This implementation is a collection of MongoDB instances that all store identical data, and as such it provides a highly available form of data storage that has high fault tolerance and can survive in the event of some of its components failing. Another benefit of MongoDB is its ability to store spatial data efficiently through the GeoJSON object types.
Although the data collected for the workflow described in this article are small in the context of "big" data (e.g., 100 GB+), with continued temporal collection of these data from Calgary Transit and if multiple users want to query the data simultaneously, a highly scalable NoSQL solution such as MongoDB is relevant. For this research, MongoDB was installed in a 4‐core CPU Azure VM with Ubuntu Server 20.04 to store data the GTFS data after processing. Using PyMongo (https://pymongo.readthedocs.io/en/stable/) and the ArcGIS API for Python packages, a custom tool was successfully created to query transit metrics on‐the‐fly, convert the query return to a spatial dataframe, and visualize the results in a custom PT service performance dashboard.
DATA VISUALIZATION
In order to visualize the on‐time reliability of bus arrivals and departures at all stops on the network for route and adjustable time‐specific queries, a web GIS dashboard was developed using Esri Calcite maps with accompanying descriptive analytics using the ArcGIS API for JavaScript. The core metrics used to assess reliability include "on‐the‐fly" average percentage of on‐time arrivals, vehicle speed (km/hr), and arrival time variance (seconds). The user interface for the dashboard is shown in Figure 7. The metrics respond dynamically to parameters such as the map's current geographic extent (as the user zooms and pans), the selection of a specific bus route, and an hourly animation that displays spatiotemporal changes in system performance. In a subsequent implementation, headway metrics will be added to the dashboard.
To display core metrics on the dashboard, the user is required first to select a date range. The date range is contingent on the available data stored in the MongoDB, which in this case ranges from January 2021 until the present‐day of collection. Using the date range query, the time slider widget automatically populates the start and end date with the default interval set to hourly. The user can freely change the interval extent. Animating the hourly intervals from the time slider can be automatically executed by clicking on the play/pause button or manually moved by the user. The example in Figure 7 shows a date range queried from March 15th until March 16th, 2021 with the time slider set to the hour of 6 a.m. (i.e., 6 a.m. Mountain Time). Thus, the results show the entire transit network with an overall average on‐time performance of 85%, vehicle speed of 29 km/h, and an arrival time at the stops of −2 s.
The date range data can be queried using the optional parameters of day of the week (e.g., Mondays, Wednesdays) and/or the selection of a specific bus route to explore more granular spatiotemporal patterns. For example, a user may want to investigate transit reliability during Wednesday rush hour between March 15th and April 15th, 2021, to investigate temporal variations caused by localized weather, which Calgary is known for due to its high altitude and chinook winds (Gough, 2008; Hasan & Barker, 1999; Zhou et al., 2017). Alternatively, the selection of a specific bus route has the capability to identify spatiotemporal patterns along its length over the prescribed time period, to identify intra‐route performance variations. These insights can help transit planners identify potential causes of performance degradation, such as traffic congestion, the number of traffic signals/intersections and left turns, and the location of known traffic incidents.
Figure 8 illustrates the path of bus route 149‐1490022 and its on‐time performance during the hour of 6 a.m.–7 a.m. on March 15th, 2021. Poor reliability performance (shown in red) appears to be related to traffic signal intersections and the number of left turns (two) that the bus passes through. Traffic congestion is another likely cause of the poor performance, as the bus path goes through several traffic signal intersections that connect to highway entrance and exit ramps. The workflow's data harvesting, processing, and visualization are continuous, however, using the dashboard to query the data for particular timeframes, and along particular routes demonstrates the potential of the dashboard to help planners identify and diagnose reliability issues across the PT network.
The dashboard will be further enhanced to show crosstabulation of descriptive analytics as requested by the user. This will allow transit planners to visualize PT reliability patterns across the network at different temporal and spatial scales, in order to identify particular problem areas. These data visualizations are currently generated outside the dashboard and include how consistently reliable the PT network is over time (Figure 9), seasonal variations in reliability (Figures 10 and 11), and which routes are performing the best (Figure 12) and worst (Figure 13) over time. Examples of these data visualization outputs are presented here for data collected during the study period reported in 2021.
Figure 9 shows that for the week of March 15th through the 22nd, 2021, the bus system functioned better (i.e., more than 80% of buses arriving on‐time) in the morning extended rush hour period (6–10 a.m.) than in the mid‐afternoon (2–4 p.m.) time period. This pattern is consistent across all days, and is likely attributable to localized traffic conditions (e.g., congestion, commute behaviors).
Figure 10 shows seasonal variations in on‐time performance of the transit network from March to August 2021. Interestingly, for a city with abrupt and significant weather variations, there are no apparent patterns in performance, other than a general service degradation in reliability during the weeks of June 14–18th and August 16–20th. While there were no specific local events during these service degradation periods, such as the annual Calgary Stampede, they do broadly coincide with changes in local COVID‐19 lockdown procedures (CBC News, 2021; CTV News Calgary, 2021). Transportation and PT data during the COVID‐19 pandemic has been anomalous (Statistics Canada, 2021), and may account for the Calgary Transit service degradations seen in the summer of 2021.
Figure 11 shows the overall average of performance on an hourly basis on each day of the week from March to August 2021. Performance patterns are consistent and show that the transit network is less reliable on weekdays in the afternoon, particularly on Wednesdays and Fridays. This suggests that transit planning efforts might be productively focused on improving service during these time periods. The cause of the poor reliability performance for both weekdays, especially Fridays, is likely due to commuter behavior and resulting congestion (Maciag, 2012). Possible interventions to lessen these types of temporal service disruptions include transit scheduling techniques (e.g., increase the frequency of bus services temporarily), implementing signal preemption for PT vehicles, temporarily adjusting routes (e.g., no left turns, less stops), and/or implementing disincentives for the use of private vehicles (e.g., congestion charge, parking costs) (Boston Region Metropolitan Planning Organization, 2018; State of Florida Department of Transportation, 2008).
Figures 9 through 11 focus on temporal variation in the on‐time performance of buses for the entire transit network in Calgary. More specifically, Figures 12 and 13 limit the focus to the 20 best and 20 worst performing transit routes, respectively, including their overall average on‐time performance as well as the performance across their daily operating hours. Both figures depict the week of March 15th–22nd, 2021 as an example.
In both figures, there is significant variance in performance across routes, and over the operating hours. In some cases, even the overall best performing routes (Figure 12) demonstrate periods of low performance reliability. For instance, bus route 44‐440015 has perfect on‐time performance between 6 and 9 a.m., but within the hour starting at 10 a.m. the performance is closer to 0%. Similar anomalies appear with the worst overall performing routes, where some reveal instances of high reliability, such as bus routes 78‐780066 and 406‐4060054. There is a case where two buses that operate on the same route, but in opposite directions have, to some degree, significantly different variances. This can be seen with bus routes 21‐210021 (inbound—worst) and 55‐550021 (outbound—better) between operating hours 7 and 10 a.m. (Figure 13). A plausible explanation could be the direction of the high demand traffic during morning hours.
From a transit planning perspective, almost all of the bus routes in Figures 12 and 13 show consistent reliability over time. This provides planners with the opportunity to investigate why some routes are consistently more reliable than others across both time and space. Possible explanations can be derived from the outputs of the existing workflow including length and sinuosity of the routes, average distance between consecutive bus stops, actual headway, and average travel speed.
Other possible explanations, such as weather, traffic incidents, number of stop signs, traffic signals, and left turns can be extracted and analyzed from publicly available data sources. An illustration of the impact of these factors can be seen in Figure 8. It displays the third least reliable route (bus route 149‐1490022) with nearly every operating hour in the red. Based on the map, it is clear that the path (west to northeast) of the bus route passes through three major traffic intersections and has to make three left turns. A predictive model, such as a random forest regression, will be implemented in a future iteration of the dashboard to validate the impact of the factors identified above. The predictive outputs can then be used to inform the route planning and implementation practice within a transit agency. In particular, the results can be used to adjust the schedule to align expectations and improve reliability. Such adjustments can be implemented in a transit app to inform potential riders of the per stop and transit route historic probability of the bus arriving late (or early) from the scheduled time, the new expected time to arrive, and the future probability in near‐real time of the bus being late/early.
The interactive dashboard (Figures 7 and 8) and the accompanying summary graphics (Figures 9–13) discussed in this section demonstrate the potential of the developed workflow to generate meaningful insights to inform transit planners. Collectively, the use of parallel processing and the spatial and data visualizations parse large volumes of complex data to create accurate metrics about the reliability performance of the PT service from the system level, down to individual transit stops. Future work will build upon and extend the visualization capabilities and dashboard described in this article to include additional metrics, and the aforementioned predictive analytic capabilities.
CONCLUSIONS
The primary goal of the research described in this article was to develop a method for harvesting high‐volume, high‐frequency GTFS‐RT data, and processing it into reliability metrics to assist PT agencies to monitor and improve their transit systems. From a technological perspective, an important sub‐goal of the work was to develop a workflow to harvest and process these data quickly and efficiently using emerging cloud computing and parallel processing approaches.
When working with high‐frequency and high‐volume data, there is always a challenge to make sense out of the complexity, so that findings can be understood and applied to both system‐wide and route‐specific improvements. There are two primary ways to reduce the complexity of the data so that planners can more effectively use it in their decision‐making. The first is visualization of the data, using spatial and data visualizations to reveal patterns and relationships. Ideally, the visualization will include a form of interactivity such that the patterns and possible relationships are revealed through active interrogation of the visualization. This article presented the development of an interactive Web‐based dashboard for spatial visualization of the transportation system from the system level down to the individual stop level, combined with data visualizations of metrics of reliability. Development and improvement of this dashboard will continue in future research.
The second method for reducing data complexity is through statistical approaches for reducing dimensionality. While this article focuses on descriptive visualizations of reliability trends, the use of parametric techniques, including exploratory, predictive, and AI‐based modeling, presents a compelling avenue for future research, especially since the parallel processing workflow outputs described herein should facilitate their development.
One of the ancillary goals of this research was to make the developed tools openly available, so that transit agencies and other researchers can adapt and extend the workflow to meet their own needs. To that end, the workflow is documented, and all of the associated code is freely available at: https://github.com/highered‐esricanada/Parallel_GTFS_Workflow.
One clear trajectory for future research and development is to extend the workflow to include more of the reliability metrics identified by Currie et al. (2012) and described in Table 1. This would both expand the breadth of the available metrics for assessing the PT service network and, potentially, improve the accuracy of the existing on‐time performance metric. Additionally, the availability of data sources beyond AVL, such as ridership or rider survey data and associated network traffic data, could significantly improve the diagnostic capabilities of the workflow.
Another avenue for further research is to incorporate data on external factors, such as weather and events, and internal factors such as route characteristics (e.g., sinuosity, length) to correlate reliability issues with their possible causes. This would allow for the development of predictive analytics to anticipate reliability problems (including potential problems for new routes) so that they can be addressed proactively. As an extension of the predictive analytics, a transit app could be implemented in the future with the ability to display the original expected time schedule and its historic probability of a late/early arrival compared with the current (i.e., adjusted) expected time with the near real‐time probability of being late/early based on current conditions.
Overall, the parallel processing workflow presented in this article represents an important contribution of this research to the literature. It provides a significant advancement in the processing and management of high‐volume, high‐frequency GTFS‐RT data. The workflow can facilitate the future research directions identified above and provide other researchers and transit agencies with an effective basis for developing their own methods for evaluating transit systems.
ACKNOWLEDGMENTS
The authors wish to thank Dr. Steven Farber and two anonymous reviewers for strengthening this research.
CONFLICT OF INTEREST
All authors declare that they have no conflicts of interest.
DATA AVAILABILITY STATEMENT
The data that support the findings of this study are openly available in Parallel‐GTFS‐Workflow at https://github.com/highered‐esricanada/Parallel‐GTFS‐Workflow.
REFERENCES
1 Amdahl, G. M. (1967). Validity of the single processor approach to achieving large scale computing capabilities. In Proceedings of the Spring Joint Computer Conference, Atlantic City, NJ (pp. 483 – 485). New York, NY : ACM.
2 Anagnostopoulos, C. N. E., Anagnostopoulos, I. E., Loumos, V., & Kayafas, E. (2006). A license plate‐recognition algorithm for intelligent transportation system applications. IEEE Transactions on Intelligent Transportation Systems, 7 (3), 377 – 392. https://doi.org/10.1109/TITS.2006.880641
3 Arriagada, J., Gschwender, A., Munizaga, M. A., & Trépanier, M. (2019). Modeling bus bunching using massive location and fare collection data. Journal of Intelligent Transportation Systems, 23 (4), 332 – 344. https://doi.org/10.1080/15472450.2018.1494596
4 Bast, H., Brosi, P., & Storandt, S. (2014). Real‐time movement visualization of public transit data. In Proceedings of the 22nd ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, Dallas, TX (pp. 331 – 340). New York, NY : ACM. https://doi.org/10.1145/2666310.2666404
5 Battarra, R., Gargiulo, C., Tremiterra, M., & Zucaro, F. (2018). Smart mobility in Italian metropolitan cities: A comparative analysis through indicators and actions. Sustainable Cities and Society, 41, 556 – 567. https://doi.org/10.1016/j.scs.2018.06.006
6 Boston Region Metropolitan Planning Organization. (2018). Transit signal priority in the Boston region: A guidebook. Retrieved from https://www.ctps.org/data/calendar/pdfs/2019/TSP‐Guidebook.pdf
7 Bueti, C., & Faulkner, D. (2013). ICTs as a key technology to help countries adapt to the effects of climate change. Washington, DC : World Resources Institute.
8 Cats, O., & Jenelius, E. (2014). Dynamic vulnerability analysis of public transport networks: Mitigation effects of real‐time information. Networks and Spatial Economics Springer, 14 (3), 435 – 463. https://doi.org/10.1007/s11067‐014‐9237‐7
9 CBC News. (2021). Here are the latest COVID‐19 statistics for Alberta and what they mean. Retrieved from https://www.cbc.ca/news/canada/calgary/alberta‐covid‐19‐data‐statistics‐numbers‐cases‐hospitalizations‐1.5514947
Chakrabarti, S., & Giuliano, G. (2015). Does service reliability determine transit patronage? Insights from the Los Angeles Metro bus system. Transport Policy, 42, 12 – 20. https://doi.org/10.1016/j.tranpol.2015.04.006
Cohen‐Blankshtain, G., & Rotem‐Mindali, O. (2016). Key research themes on ICT and sustainable urban mobility. International Journal of Sustainable Transportation, 10 (1), 9 – 17. https://doi.org/10.1080/15568318.2013.820994
Cramer, A., Cucarese, J., Tran, M., Lu, A., & Reddy, A. (2009). Performance measurements on mass transit: Case study of New York City transit authority. Transportation Research Record, 2111 (1), 125 – 138. https://doi.org/10.3141/2111‐15
CTV News Calgary (2021). Potentially heavy showers on the way for Calgary. Retrieved from https://calgary.ctvnews.ca/potentially‐heavy‐showers‐on‐the‐way‐for‐calgary‐1.5548638
Currie, G., Douglas, N. J., & Kearns, I. (2011). An assessment of alternative bus reliability indicators. In Australasian Transport Research Forum 2012 Proceedings, Perth, Australia (pp. 1 – 20).
Desaulniers, G., & Hickman, M. (2007). Chapter 2 Public Transit. In Barnhart, C., Laporte, G., eds. Handbooks in operations research and management science (Vol. 14, pp. 69 – 127). Amsterdam : Elsevier.
Figliozzi, M. A., Feng, W., Lafferriere, G., & Feng, W. (2012). A study of headway maintenance for bus routes: Causes and effects of ' bus bunching ' in extensive and congested service areas. Portland, OR : Transportation Research Education Center. Retrieved from https://rosap.ntl.bts.gov/view/dot/24701
Google. (2021). Google transit APIs. Retrieved from https://developers.google.com/transit/gtfs‐realtime/reference
Gough, W. A. (2008). Theoretical considerations of day‐to‐day temperature variability applied to Toronto and Calgary, Canada data. Theoretical and Applied Climatology, 94, 97 – 105. https://doi.org/10.1007/s00704‐007‐0346‐9
Harsha, M. M., Mulangi, R., & Kumar, D. (2019). Analysis of bus travel time variability using automatic vehicle location data. Transportation Research Procedia, 48, 3283 – 3298. https://doi.org/10.1016/j.trpro.2020.08.123
Hasan, Y., & Barker, D. (1999). The impact of unseasonable or extreme weather on traffic activity within Lothian region, Scotland. Journal of Transport Geography, 7, 209 – 213. https://doi.org/10.1016/S0966‐6923(98)00047‐7
Hu, W., & Shalaby, A. (2017). Use of automated vehicle location data for route‐ and segment‐level analyses of bus route reliability and speed. Transportation Research Record, 2649, 9 – 19. https://doi.org/10.3141/2649‐02
Kunama, N., Worapan, M., Phithakkitnukoon, S., & Demissie, M. (2017). GTFS‐Viz: Tool for preprocessing and visualizing GTFS data. In Proceedings of the 2017 ACM International Joint Conference on Pervasive and Ubiquitous Computing and Proceedings of the 2017 ACM International Symposium on Wearable Computers, Maui, HI (pp. 388 – 396). New York, NY : ACM. https://doi.org/10.1145/3123024.3124415
Levinson, H. (1991). Supervision strategies for improved reliability of bus routes. Washington, DC : Transportation Research Board.
Luo, Q., Gee, M., Piccoli, B., Work, D., & Samaranayake, S. (2022). Managing public transit during a pandemic: The trade‐off between safety and mobility. Transportation Research Part C: Emerging Technologies, 138, 103592. https://doi.org/10.1016/j.trc.2022.103592
Maciag, M. (2012). The best days to commute in metro areas. Retrieved from https://www.governing.com/archive/best‐days‐to‐commute‐drive‐metro‐areas.html
Mazloumi, E., Currie, G., & Rose, G. (2009). Using GPS data to gain insight into public transport travel time variability. Journal of Transportation Engineering, 136 (7), 623 – 631. https://doi.org/10.1061/(ASCE)TE.1943‐5436.0000126
Mesbah, M., Lin, J., & Currie, G. (2015). "Weather" transit is reliable? Using AVL data to explore tram performance in Melbourne, Australia. Journal of Traffic and Transportation Engineering, 2 (3), 125 – 135. https://doi.org/10.1016/j.tte.2015.03.001
Morfoulaki, M., Myrovali, G., & Kotoula, K. (2015). Increasing the attractiveness of public transport by investing in soft ICT based measures: Going from words to actions under an austerity backdrop—Thessaloniki's case, Greece. Research in Transportation Economics, 51, 40 – 48. https://doi.org/10.1016/j.retrec.2015.07.006
OECD. (2021, February 22). COVID‐19 and a new resilient infrastructure landscape. Paris, France : Organisation for Economic Co‐operation and Development. Retrieved from https://read.oecd‐ilibrary.org/view/?ref=1060_1060483‐4roq9lf7eu&title=COVID‐19‐and‐a‐new‐resilient‐infrastructure‐landscape
Pan American Health Organization (2016). Maintenance of essential services. Retrieved from https://www.paho.org/disasters/dmdocuments/RespToolKit%5f24%5fTool%2016%5fMaintenanceofEssentialServices.pdf
Pee, L. G., & Pan, S. (2021). Climate‐intelligent cities and resilient urbanisation: Challenges and opportunities for information research. International Journal of Information Management, 63, 102446. https://doi.org/10.1016/j.ijinfomgt.2021.102446
Perk, V., Flynn, J., & Volinski, J. (2008). Transit ridership, reliability, and retention (Publication NCTR‐776‐07). Tallahassee, FL : Florida Department of Transportation.
Politis, I., Papaioannou, P., Basbas, S., & Dimitriadis, N. (2010). Evaluation of a bus passenger information system from the users' point of view in the city of Thessaloniki, Greece. Research in Transportation Economics, 29, 249 – 255. https://doi.org/10.1016/j.retrec.2010.07.031
Polus, A. (1978). Modeling and measurements of bus service reliability. Transportation Research, 12 (4), 253 – 256. https://doi.org/10.1016/0041‐1647(78)90067‐9
Prommaharaj, P., Phithakkitnukoon, S., Demissie, M., Kattan, L., & Ratti, C. (2020). Visualizing public transit system operation with GTFS data: A case study of Calgary, Canada. Heliyon, 6 (4), e03729. https://doi.org/10.1016/j.heliyon.2020.e03729
Sneader, K., & Lund, S. (2020). COVID‐19 and climate change expose dangers of unstable supply chains. Los Angeles, CA : McKinsey Global Institute.
Snellen, D., & Hollander, G. (2017). ICT's change transport and mobility: Mind the policy gap! Transportation Research Procedia, 26, 3 – 12. https://doi.org/10.1016/j.trpro.20217.07.003
State of Florida Department of Transportation. (2008). Transit ridership, reliability, and retention. Retrieved from https://www.nctr.usf.edu/pdf/77607.pdf
Statistics Canada (2021). Public transit in a post‐COVID‐19 Canada. Retrieved from https://www150.statcan.gc.ca/n1/en/pub/45‐28‐0001/2021001/article/00030‐eng.pdf?st=duthQO3l
Stewart, C., Diab, E., Bertini, R., & El‐Geneidy, A. (2016). Perspectives on transit: Potential benefits of visualizing transit data. Transportation Research Record, 2544 (1), 90 – 101. https://doi.org/10.3141/2544‐11
Strathman, J., & Hopper, J. (1993). Empirical analysis of bus transit on‐time performance. Transportation Research Part A—Policy and Practice, 27, 93 – 100. https://doi.org/10.1016/0965‐8564(93)90065‐S
Tirachini, A., Godachevich, J., Cats, O., Muñoz, J., & Soza‐Parra, J. (2021). Headway variability in public transport: A review of metrics, determinants, effects for quality of service and control strategies. Transport Reviews, 1 – 25. Epub ahead of print. https://doi.org/10.1080/01441647.2021.1977415
United Nations. (2015). Information and communication technology for urban climate action. Retrieved from https://unhabitat.org/sites/default/files/download‐manager‐files/Information%20and%20Communication%20Technology%20for%20Urban%20Climate%20Action.pdf
Wang, L., Xue, X., Zhao, Z., & Wang, Z. (2018). The impacts of transportation infrastructure on sustainable development: Emerging trends and challenges. International Journal of Environmental Research and Public Health, 15 (6), 1172. https://doi.org/10.3390/ijerph15061172
Wessel, N., Allen, J., & Farber, S. (2017). Constructing a routable retrospective transit timetable from a real‐time vehicle location feed and GTFS. Journal of Transport Geography, 62, 92 – 97. https://doi.org/10.1016/j.jtrangeo.2017.04.012
Wessel, N., & Farber, S. (2019). On the accuracy of schedule‐based GTFS for measuring accessibility. The Journal of Transport and Land Use, 12 (1), 475 – 500. https://doi.org/10.5198/jtlu.2019.1502
Wood, D. (2015). A framework for measuring passenger‐experienced transit reliability using automated data. Department of Civil and Environmental Engineering, MIT. Retrieved from https://dspace.mit.edu/bitstream/handle/1721.1/99539/924822213‐MIT.pdf;sequence=1
Zhao, Y. (1997). Vehicle location and navigation systems. Norwood, MA : Artech House.
Zhou, M., Wang, D., Li, Q., Yue, Y., Tu, W., & Cao, R. (2017). Impacts of weather on public transport ridership: Results from mining data from different sources. Transportation Research Part C: Emerging Technologies, 75, 17 – 29. https://doi.org/10.1016/j.trc.2016.12.001
By Anastassios Dardas; Brent Hall; Jon Salter and Hossein Hosseini
Reported by Author; Author; Author; Author