Treffer: A geospatial workflow for the assessment of public transit system performance using near real‐time data

Title:

A geospatial workflow for the assessment of public transit system performance using near real‐time data

Authors:

Anastassios Dardas, Brent Hall, Jon Salter, Hossein Hosseini

Source:

Transactions in GIS. 26:1642-1664

Publisher Information:

Wiley, 2022.

Publication Year:

2022

Subject Terms:

0502 economics and business, 05 social sciences, 11. Sustainability

Document Type:

Fachzeitschrift Article

Language:

English

ISSN:

1467-9671
1361-1682

DOI:

10.1111/tgis.12942

Rights:

Wiley Online Library User Agreement

Accession Number:

edsair.doi...........86b685766a3a91f9cc768de7248a0df4

Database:

OpenAIRE

Weitere Informationen

This article presents the development of a Geographical Information Systems (GIS) workflow that harvests high‐volume and high‐frequency near real‐time data from a public General Transit Feed Specification (GTFS) and calculates metrics for the assessment of on‐time and route speed performance for a public transit system. The approach is applied to near real‐time and static GTFS data collected over a 9‐month period for the City of Calgary, Alberta, Canada. The workflow uses two Azure Virtual Machines (VMs), one to harvest the data and the other to process observations in parallel using Python and the ArcGIS API libraries. A Web GIS application is described that queries data from MongoDB to visualize the performance results in spatiotemporal form. The purpose of the workflow and Web GIS application is to provide actionable information to transit planners to improve public transportation systems. The data management and analysis workflow is transferable to similar GTFS data from other cities.

AN0157689751;7ql01jun.22;2022Jun30.06:47;v2.2.500

A geospatial workflow for the assessment of public transit system performance using near real‐time data

<sbt id="AN0157689751-2">INTRODUCTION</sbt>

Essential services are defined as basic necessities that are accessible to the general public, and that contribute to their well‐being and convenience. Examples of essential services include healthcare, response units (e.g., police, fire, EMS), clean water, utilities, sanitation, communications, vital goods (e.g., fuel, groceries), and transportation (Pan American Health Organization, 2016). Together, these interconnected systems form an infrastructure that sustains modern life. The COVID‐19 pandemic has revealed the highly fragile nature of such essential services and has reminded societies around the world that, in the face of adversity, system adaptation and resilience are imperative (OECD, 2021; Sneader & Lund, 2020). One important infrastructure component is public transit (Luo, Gee, Piccoli, Work, & Samaranayake, 2022).

All public transit agencies strive to provide a good quality of service to the populations they serve by enabling passengers to travel smoothly at an affordable fare (Desaulniers & Hickman, 2007). However, achieving this goal is a challenge that must maintain a delicate balance between asset and operational costs, while achieving strategic, operational, and tactical planning goals (Desaulniers & Hickman, 2007). Strategic planning involves maximizing service quality under budgetary constraints, whereas operational concerns focus on achieving service delivery targets while minimizing costs. Tactical planning, on the other hand, seeks to improve service quality by resolving the details of transit services, such as route definition, the frequency of service, and setting departure and arrival schedules (Desaulniers & Hickman, 2007). In the face of these challenges, many transit agencies are seeking solutions to make their infrastructure more resilient and adaptable. One emerging approach to this is the development and implementation of information and communication technologies (ICT) to manage and plan the replacement of aging infrastructure (Bueti & Faulkner, 2013; Pee & Pan, 2021; United Nations, 2015). The approach discussed in this article addresses both of these needs. Specifically, it develops a replicable workflow that harvests and analyzes routinely collected and publicly accessible urban bus service data and presents these data in a web GIS application that allows current bus service efficiency and reliability to be assessed. The workflow, metric calculation and visualization support an assessment of system reliability that informs both current management, and future system planning. The longer‐term goal of this research is to provide a straightforward means for transit agencies to improve their existing transit network reliability and service provision to members of the public.

The article first reviews recent research on public transit services, highlighting the need for assessment of service efficiency and reliability. A replicable workflow is then presented to collect the data required for service assessment, using the city of Calgary as a case study. The processing steps in this workflow are explained in detail, along with the resulting datastore. The data are then analyzed to highlight system performance over both time and space. A visualization dashboard is presented that allows transportation planners to query system performance and quickly identify areas where reliability and efficiency of service performance are sub‐optimal for members of the public. The article concludes with a summary of its major contributions, and suggestions for future research. Documentation for the application and associated data processing source code in the form of a modifiable Python package is made available on Github via the URL presented in the conclusion.

LITERATURE REVIEW

Public transit (PT) systems are an essential service in modern cities. They provide an affordable mobility option for urban residents, reduce negative environmental impacts (i.e., traffic congestion), and have the potential to stimulate economic growth through improved access to commercial services (Cats & Jenelius, 2014; Wang, Xue, Zhao, & Wang, 2018). These positive contributions require that a PT system is efficient, robust, adaptable, and reliable. An efficient system must be affordable, convenient, accessible, and comfortable for transit users. Robustness requires resiliency from disturbances along the transportation network including infrastructure repair and vehicle malfunctions (Cats & Jenelius, 2014). Adaptability implies that the system is subject to evaluation and change to improve both its efficiency and robustness. Collectively, the attributes of efficiency, robustness, and adaptability also contribute to the overall reliability of a PT system.

Reliability and speed are, arguably, the most significant measures of transit performance for transit agencies and passengers (Hu & Shalaby, 2017). From a transit agency's perspective, having an unreliable and/or slow PT system negatively impacts ridership and operational costs (Hu & Shalaby, 2017; Perk, Flynn, & Volinski, 2008; Polus, 1978). This can lead to a negative feedback loop for PT systems by forcing passengers to seek alternatives, usually personal vehicles, which increase the number of cars on the road and exacerbates traffic congestion, further reducing the reliability of PT services (Arriagada, Gschwender, Munizaga, & Trépanier, 2019; Figliozzi, Feng, Lafferriere, & Feng, 2012; Tirachini, Godachevich, Cats, Muñoz, & Soza‐Parra, 2021). Additionally, poor reliability often leads to bus bunching (i.e., little to no gap between consecutive buses), excessive travel time, and passenger overcrowding in vehicles (Hu & Shalaby, 2017). Levinson (1991) supports this observation in noting that headway variation (i.e., less or more bus bunching) is the primary characteristic of an unreliable PT service.

Poor reliability of PT can also be caused by a combination of internal and external factors. Internal factors include the complexity of the system, characterized by factors such as sinuosity, the length of bus routes, the number of stops, and the spacing of boardings (Strathman & Hopper, 1993). External factors include weather, disruptions from on‐street parking, road maintenance, signal timing, and traffic congestion and incidents (Strathman & Hopper, 1993). Transit managers can improve PT reliability, and address both internal and external factors through effective decision‐making based on appropriate transit metrics and analysis (Cramer, Cucarese, Tran, Lu, & Reddy, 2009; Wood, 2015).

Effective implementation of ICTs in transportation to monitor performance and diagnose problems has the potential to make an existing PT network more efficient, robust, adaptable, and reliable without the need to alter the physical network itself (Battarra et al., 2018; Cohen‐Blankshtain & Rotem‐Mindali, 2016; Snellen & Hollander, 2017; Zhao, 1997). This use of ICTs is generally known as intelligent transportation systems (ITS), which can be subdivided into two categories, namely intelligent infrastructure systems (IIS) and intelligent vehicle systems (IVS). IIS focuses on operational perspectives of the system (e.g., PT operations, road management), whereas IVS focuses on the user's perspectives of the system (e.g., safety, productivity) (Anagnostopoulos, Anagnostopoulos, Loumos, & Kayafas, 2006; Cohen‐Blankshtain & Rotem‐Mindali, 2016). Both perspectives have the potential to alter mobility patterns by optimizing a PT system's capacity and influencing residents' travel behavior through the calculation of different forms of transit metrics (Cohen‐Blankshtain & Rotem‐Mindali, 2016; Morfloulaki, Myrovali, & Kotoula, 2015). The workflow presented in the next section focuses on the IIS component of an ITS, in particular, its impacts on transit reliability.

Transit metrics can be derived from a variety of data sources including automatic vehicle location (AVL) tracking, automated passenger counts (APC), automatic fare collection (AFC) devices, real‐time passenger information (RTPI) at bus stops and on‐board, and vehicle operational data gathering and archiving systems (Politis, Papaioannou, Basbas, & Dimitriadis, 2010). AVLs are commonly used for analysis purposes given the relative ease to install global positioning system (GPS) devices on PT vehicles. Vehicle GPS allow the capture of real‐time spatiotemporal data that can be used to assess reliability under different traffic conditions (Harsha, Mulangi, & Kumar, 2019; Mazloumi, Currie, & Rose, 2009). Additionally, AVL can be combined with RTPI systems, to improve passenger satisfaction by minimizing actual wait times and providing passengers with easily accessible information on projected vehicle arrival and departure times (Politis et al., 2010).

Based on these factors, many research studies have used AVL data capture to explore transit reliability. For instance, Harsha et al. (2019) used AVL to analyze travel time distribution at different spatiotemporal scales. Mesbah, Lin, and Currie (2015) acquired a historical AVL dataset to examine the effect of weather conditions on the travel time reliability of Melbourne's streetcar network. Chakrabarti and Giuliano (2015) found service reliability to be a significant indicator of patronage through the combined use of AVL and APC data for the Los Angeles Metro bus transit system.

Collectively, AVL has enabled transit analysts to measure PT reliability based on one or more of the nine common metrics given in Table 1 (after Currie, Douglas, & Kearns, 2012).

1 TABLECommon transit metrics to measure PT reliability

<table><thead valign="top"><tr><th align="left">Metric</th><th align="left">Description & purpose</th><th align="left">Data source(s)</th><th align="left">Ease of use</th><th align="left">Relevant for the paper's workflow</th></tr></thead><tbody><tr><td align="left">% of buses cancelled</td><td align="left">The percentage of buses cancelled before or during their trip</td><td align="left">AVL & Scheduled Timetable</td><td align="left">Intermediate</td><td align="left">Yes—potentially</td></tr><tr><td align="left">Purpose: Relative for extra wait time, bus bunching, and headway inconsistencies</td></tr><tr><td align="left">% of service departing on‐time</td><td align="left">The percentage of buses that depart on‐time from their respective transit stop</td><td align="left">AVL & Scheduled Timetable</td><td align="left">Intermediate</td><td align="left">Yes—potentially</td></tr><tr><td align="left">Purpose: Indicates punctuality mainly for infrequent services (i.e., >10 min)</td></tr><tr><td align="left">% of service arriving on‐time</td><td align="left">The percentage of buses that arrive on‐time at the stop and/or terminus</td><td align="left">AVL & Scheduled Timetable</td><td align="left">Easy</td><td align="left">Yes—main focus</td></tr><tr><td align="left">Purpose: Indicates punctuality</td></tr><tr><td align="left">Excess wait time</td><td align="left">The time interval between consecutive buses and how it compares to the scheduled timetable</td><td align="left">AVL & Scheduled Timetable</td><td align="left">Easy</td><td align="left">Yes—cruder measure</td></tr><tr><td align="left">Purpose: For frequent services (i.e., <10 min), indicates how reliable and evenly spaced the PT service is</td></tr><tr><td align="left">Average lateness</td><td align="left">The degree of lateness calculated by taking the % of buses late multiplied by the number of minutes late</td><td align="left">AVL & Scheduled Timetable</td><td align="left">Easy</td><td align="left">Yes—cruder measure</td></tr><tr><td align="left">Purpose: Indicates punctuality</td></tr><tr><td align="left">Service variability indicators</td><td align="left">The variability in departure or on‐board bus times and actual arrival of frequent services can be calculated using the coefficient of variation, variance, and standard deviation</td><td align="left">AVL & Scheduled Timetable</td><td align="left">Intermediate</td><td align="left">Yes—potentially</td></tr><tr><td align="left">Purpose: Indicates the degree of bus bunching</td></tr><tr><td align="left">Reliability buffer index</td><td align="left">The ratio of 95th percentile commute time over the average commute time</td><td align="left">AVL & Scheduled Timetable & Bluetooth/GPS loggers</td><td align="left">Difficult</td><td align="left">No</td></tr><tr><td align="left">Purpose: Illustrates travel time reliability and enables passengers to build a buffer time into their trip planning</td></tr><tr><td align="left">Passenger ratings of reliability</td><td align="left">Customer satisfaction surveys</td><td align="left">Surveys</td><td align="left">Difficult</td><td align="left">No</td></tr><tr><td align="left">Purpose: Subjective measure on trip ratings, specifically on delays</td></tr><tr><td align="left">Customer complaints</td><td align="left">This measure collects personal PT reliability experiences (e.g., comments)</td><td align="left">Surveys, Social Media (e.g., Twitter) & AVL & Scheduled Timetable</td><td align="left">Difficult</td><td align="left">No</td></tr><tr><td align="left">Purpose: Subjective measure on trips and can be used to correlate with actual on‐time measurements</td></tr><tr><td align="left">Customer journey time delay</td><td align="left">Accounts for the difference between customers expected and actual travel time from their origin to their destination</td><td align="left">AVL & AFC & Scheduled Timetable & Bluetooth/GPS loggers/Trip Questionnaire</td><td align="left">Difficult</td><td align="left">No</td></tr><tr><td align="left">Purpose: Illustrates the travel time reliability of the entire bus trip</td></tr></tbody></table>

The dependence of the workflow described in the next section on AVL data and scheduled timetables limited the actual number of metrics that could be examined from Table 1 in this research. These include percentage of buses cancelled, percentage of service departing on‐time, percentage of service arriving on‐time, excess wait time, average lateness, and service variability indicators. Of these metrics, the percentage of vehicles arriving on‐time is particularly relevant since it is relatively easy to calculate and interpret and only requires AVL data capture and an arrival schedule, with no need to factor in passenger loads (i.e., boarding and alighting times). Additionally, on‐time performance provides a useful diagnostic for transit agencies to identify which stops on which routes need further assessment. Thus, on‐time performance is used as the measure of PT service reliability for this study.

Numerous previous studies have used GTFS to evaluate transit reliability metrics. For example, Bast, Brosi, and Storandt (2014) developed an efficient solution that displays the current state (e.g., delays, nearby vehicles, transit coverage) of 80 transit networks using GTFS data. Stewart, Diab, Bertini, and El‐Geneidy (2016) collected near real‐time GTFS data to generate performance measures based on their relevance to transit planners, including overlapping service areas, approximate location of each bus every 30 s, wait time at stops, and bus and bicycle‐share systems overlap. Wessel, Allen, and Farber (2017) describe a method that retroactively improves the accuracy of the GTFS using the NextBus API for the Toronto Transit Commission (TTC). This study was extended by Wessel and Farber (2019) to compare the actual versus expected operation of the TTC. Kunama, Worapan, Phithakkitnukoon, and Demissie (2017) developed a desktop tool that preprocesses GTFS data and visualizes animations of it on a transit network map, including the number of public transport vehicles per hour and the selection of routes, trips, and stations. Similarly, Prommaharaj, Phithakkitnukoon, Demissie, Kattan, and Ratti (2020) created a dynamic PT operation visualization tool to display mobility, speed, flow, density, headway, and analysis using static GTFS files provided by Calgary Transit.

This article extends this body of research by developing a replicable workflow tool for parallel processing of near real‐time GTFS data to improve efficiency. This article also documents the development of a web‐based visualization dashboard to interrogate PT operations dynamically, including per route and per hour of day selected over an adjustable date range. This visualization provides a relatively fine‐grained perspective on the functioning and reliability of the system, its constituent bus routes, and individual bus stops to aid transportation planners in assessing the causes of reliability problems. The workflow and visualizations are derived from raw GTFS data downloaded from Calgary Transit's open data feed, as described in the next section.

DATA

The GTFS data standard used in this article is commonly provided through public‐facing web portals by public transit agencies around the world to make their transit data accessible to researchers in an established format that comprises static and real‐time information. International examples of these GTFS feeds include the MBTA (https://www.mbta.com/developers/gtfs‐realtime) (Boston), STM (https://developpeurs.stm.info/documentation/gtfsrtv2) (Montreal), TfNSW (https://opendata.transport.nsw.gov.au/) (Sydney), and HSL (https://hsldevcom.github.io/gtfs%5frt/) (Helsinki). A comprehensive list of transit agencies that publish their GTFS data can be viewed here (https://transitfeeds.com/).

The static GTFS data streamed for public access by Calgary Transit is provided as eight delimited text files that include contact details of the transit agency, geographic location of the transit routes and stops, and schedule information (Table 2). With the exception of agency.txt, the remainder of the delimited text files are connected to one another using an entity relationship model (Figure 1). Scheduled updates of the static GTFS files are not consistent (e.g., every 2 weeks) but rather are contingent on the publishing schedule of the transit agency. For example, an update after 2 weeks may be followed by a subsequent update after 2 months.

2 TABLEStatic GTFS files (after Prommaharaj et al., 2020)

<table><thead valign="top"><tr><th align="left">Delimited file name</th><th align="left">Description</th></tr></thead><tbody><tr><td align="left">agency.txt</td><td align="left">Information about the transit agency</td></tr><tr><td align="left">routes.txt</td><td align="left">Transit route information</td></tr><tr><td align="left">trips.txt</td><td align="left">Trip ID details associated to the transit route</td></tr><tr><td align="left">stop_times.txt</td><td align="left">Scheduled arrival and departure time of the transit stop for each trip id</td></tr><tr><td align="left">stops.txt</td><td align="left">Transit stop details including x, y coordinates</td></tr><tr><td align="left">shapes.txt</td><td align="left">Transit route details including x, y coordinate paths that can form polylines</td></tr><tr><td align="left">calendar.txt</td><td align="left">Days of operation per transit route</td></tr><tr><td align="left">calendar_dates.txt</td><td align="left">Generic version of calendar.txt</td></tr></tbody></table>

From the files described in Table 2, routes.txt, stops.txt, and shapes.txt are used to create individual transit route lines (polylines) and the associated stops (points) as shapefiles to pinpoint GPS recorded locations of transit vehicles on the bus network, whereas trips.txt and stop_times.txt are used to derive the transit metrics presented in the next section. Table 3 contains summary counts on each static GTFS update, and the size of the Calgary Transit system based on the number of routes, trips, and transit stops during the data collection period (from March 15, 2021 to August 8, 2021). Similarly to the publishing schedule inconsistency noted above, Table 3 reveals discrepancies in each feature over time that is maintained by the transit agency (https://transitfeeds.com/l/186‐calgary‐ab‐canada).

3 TABLESummary of Calgary transit system per GTFS update

<table><thead valign="top"><tr><th align="left">GTFS update</th><th align="left"># of routes</th><th align="left"># of trips</th><th align="left"># of stops</th></tr></thead><tbody><tr><td align="left">2021‐03‐15</td><td align="left">536</td><td align="left">62097</td><td align="left">6077</td></tr><tr><td align="left">2021‐04‐08</td><td align="left">534</td><td align="left">51859</td><td align="left">6076</td></tr><tr><td align="left">2021‐04‐23</td><td align="left">271</td><td align="left">22892</td><td align="left">6076</td></tr><tr><td align="left">2021‐05‐04</td><td align="left">539</td><td align="left">45406</td><td align="left">6077</td></tr><tr><td align="left">2021‐05‐11</td><td align="left">537</td><td align="left">46825</td><td align="left">6077</td></tr><tr><td align="left">2021‐05‐21</td><td align="left">270</td><td align="left">25352</td><td align="left">6077</td></tr><tr><td align="left">2021‐05‐26</td><td align="left">272</td><td align="left">23865</td><td align="left">6077</td></tr><tr><td align="left">2021‐06‐11</td><td align="left">415</td><td align="left">85593</td><td align="left">6100</td></tr><tr><td align="left">2021‐06‐22</td><td align="left">413</td><td align="left">86069</td><td align="left">6100</td></tr><tr><td align="left">2021‐08‐04</td><td align="left">148</td><td align="left">25305</td><td align="left">5727</td></tr><tr><td align="left">2021‐08‐06</td><td align="left">146</td><td align="left">22793</td><td align="left">5727</td></tr></tbody></table>

The near real‐time GTFS (GTFS‐RT) data (separate from the static GTFS data in Table 2) record frequent updates (i.e., every 30 s) of each transit vehicle's GPS location that includes information about the traffic the vehicle experiences. Traffic information can include travel speed, congestion level, vehicle stop status, occupancy status, location, and timestamp (Google Developers, 2021). The GTFS‐RT file is packaged in binary protocol buffer (.pb) format, which must be parsed and translated into XML, JSON, or other machine readable formats. The GTFS‐RT data provided by Calgary Transit include the positions (i.e., x, y coordinates) of every vehicle along with their trip_id number and timestamp. Trip_id is associated with a specific transit route (i.e., route_id). Based on the published documentation, GTFS‐RT updates for Calgary Transit (https://data.calgary.ca/Transportation‐Transit/Calgary‐Transit‐Realtime‐Trip‐Updates‐GTFS‐RT/gs4m‐mdc2/data?no%5fmobile=true) occur every 30 s; however, there are possible discrepancies in the update intervals per vehicle. In fact, some of the vehicle locations were actually updated less frequently than or more frequently than every 30 s, with an approximate maximum mean of 61 s and a minimum mean of 14 s for individual vehicles. The coefficient variation is 7.1, indicating a relatively high dispersion around the mean for all vehicles, which is close to 30 s.

The workflow implemented to harvest and assemble these data into a form that allows reliability, as measured by on‐time and route speed performance, of the Calgary PT system is described in the next section.

DATA WORKFLOW

The most challenging aspect of working with GTFS data (real‐time and static) is establishing a reliable workflow that can harvest near real‐time and static GTFS files and estimate transit metrics, while controlling computation and technical overhead costs for the data collection, storage, and analytics. For ease of readability, the workflow shown in Figure 2 is split into three sections, namely data harvesting, data processing, and data storage.

Each component is managed in its own Microsoft Azure virtual machine (VM) with the latter chained dependently to the former based on scheduled tasks. For example, the data processing is scheduled to start at the end of the evening (e.g., 10:00 p.m.) once the data harvesting task is complete, and the data storage receives transit metric output when data processing is complete. Azure VMs were chosen for development purposes due to the flexibility of server type, range of database choices, ability to scale computational capacity (e.g., storage, RAM, and CPUs) on‐the‐fly, and the ability to set automated tasks (i.e., start up and deallocate).

Data harvesting

In the Azure portal, an automated task containing the data harvesting component is set to start up a dual‐core Ubuntu 20.04 VM (i.e., mini VM) once per day at 6:00 a.m. Mountain Time and shut down at 8:00 p.m. This time window minimizes runtime costs in the long‐term while capturing the most important travel periods (i.e., morning, afternoon, and evening) to allow analysis of spatiotemporal variations in performance of the PT system. Upon startup, a scheduled task initiates a custom Python script which harvests near real‐time GTFS data every 30 s and appends the return to a CSV file named for the date of collection (e.g., "GTFSRT_Calgary_2021‐6‐28.csv"). The most notable Python packages used in this code are Requests (https://docs.python‐requests.org/en/latest/) to download the PB file, google.transit (https://developers.google.com/transit/gtfs‐realtime/examples/python‐sample) to parse it, and Pandas (https://pandas.pydata.org/) to structure and append data to the CSV file.

At the end of the harvesting period, the size of the CSV file on an average weekday is 72 MB or 600,000 observations of all available vehicle locations within the Calgary Transit network. Days on the weekend produced files about half that size. This aggregates to approximately 1.4 GB (i.e., 14.4 million rows) per month, or a total of 11 GB of raw data collected from the study period of January 2021 to August 2021. Before shutting down the VM, another task is scheduled to transfer the CSV file securely to another Azure VM for data processing using the subprocess (https://docs.python.org/3/library/subprocess.html) Python package. For security purposes, a copy of the raw data is saved in the mini VM to provide a redundant backup.

Data processing

The data processing step in Figure 2 seeks to optimize runtime performance. It is run in parallel to reduce runtime costs on a 96‐core Azure Ubuntu 20.04 VM with ArcGIS Server installed. This VM was chosen because it had the highest number of cores and RAM (384 GB) available in Azure to maximize parallel processing performance and to achieve the level of computing capacity required. To test the runtime difference, this component was tested on a 4‐core PC with hyperthreading, which took about 4 h to complete. The same task with parallel processing on the 96‐CPU VM completes within 20 min. This substantial runtime difference surpasses Amdahl's Law (Amdahl, 1967) in practice, since approximately 95% of the processing steps run in parallel.

The processing steps are shown in the right‐hand side of Figure 3. Specifically, these are: (1) Check static GTFS for updates; (2) Geoprocess vehicle locations; (3) Extract geographic information; (4) Quality Assurance (QA) and Quality Check (QC); (5) Enrich data; (6) Interpolate speed & arrival time; (7) Calculate metrics; and (8) Export to MongoDB. Steps 1 through 7 run in parallel.

Similar to the data harvesting component, an automated daily task is set to start up the 96 CPU VM at 8:00 p.m. Mountain Time. This initiates the entire data processing workflow, starting with an update check of the static GTFS files. This first step is crucial as it ensures that the transit metrics are calculated correctly by having the current date of the GTFS data aligned with the most recently scheduled (i.e., stop_times.txt) and its corresponding routes (i.e., shapes.txt, trips.txt) and stops (i.e., stops.txt) information (Table 2). The update check is completed by extracting the most recent date of the GTFS static files from a transit feed site (https://transitfeeds.com/p/calgary‐transit/238) provided by Calgary Transit and comparing it with the date of the harvested CSV file. For example, if the CSV file was collected on March 15th and the existing GTFS static files are from February 15th with the most recent update from the site remaining the same, then there is no need to update. On the other hand, if the most recent update from the site is March 13th, then these files are to be used until the next GTFS update is available.

The GTFS update process includes the creation of transit stops per route and individual routes as Esri shapefiles. Shapefiles were chosen rather than feature classes in a geodatabase because they can be created and accessed in parallel. For the individual routes, two shapefiles are created in Step 1 for the requirements of Steps 2–4 and 6. An undissolved file contains individual polyline segments that represent a path toward each transit stop for that route and a second file dissolves the segments from the first file to optimize processing efficiency in the geoprocessing step.

The geoprocessing vehicle locations step (Step 2) identifies the exact location for each trip_id that is associated with its transit route using the dissolved and undissolved shapefiles. This is done by: (1) snapping the recorded trip_id to the nearest dissolved line and taking its x, y coordinates; (2) iteratively identifying which dissolved line the snapped point is within and acquiring the stop sequence (e.g., 4th stop of 72); (3) querying the undissolved segments by stop sequence to have a subset list of candidates, and (4) iterating each candidate segment until the correct snapped point is found.

While these steps may initially seem to be redundant, they reduce the processing time of Step 2 by up to 80% as opposed to identifying directly which undissolved line the trip_id is within. Figure 4 illustrates an example of the geoprocessing steps. The red triangle in the yellow circle indicates the target transit stop that the vehicle is approaching. All of the described processes in Step 2 are accomplished using the ArcGIS API for Python (https://developers.arcgis.com/python/api‐reference/) and ArcPy (https://www.esri.com/en‐us/arcgis/products/arcgis‐python‐libraries/libraries/arcpy) packages in ArcGIS Pro. Step 3 extracts the geographic information (i.e., the index of each undissolved segment identified) of all recorded trip_id vehicle movements and creates a Pandas DataFrame (Table 3) that requires validation in Step 4 (Table 4).

4 TABLEData structure when extracting the vehicle's geographic information

<table><thead valign="top"><tr><th align="left">Field name</th><th align="left">Description</th></tr></thead><tbody><tr><td align="left">TripIDs</td><td align="left">The trip_id's number (e.g., 54827047)</td></tr><tr><td align="left">Undiss_Ids</td><td align="left">The index value of the identified undissolved segment</td></tr><tr><td align="left">Veh_y_loc</td><td align="left">The snapped y‐coordinate of the vehicle's location</td></tr><tr><td align="left">Veh_x_loc</td><td align="left">The snapped x‐coordinate of the vehicle's location</td></tr><tr><td align="left">Veh_Movement</td><td align="left">The nth time/movement in which the vehicle was recorded in the GTFS‐RT.</td></tr><tr><td align="left">Local_Time</td><td align="left">The timestamp of the vehicle being recorded</td></tr><tr><td align="left">Uniquer</td><td align="left">A unique field (TripIDs + Veh_Movement)</td></tr></tbody></table>

Most transit networks will have bus routes with looped paths (e.g., cul‐de‐sacs and loops around a neighborhood route) that return back to the same path but travel in the opposite direction. Cases such as these in a GIS are problematic as they contain overlapping polylines and whenever a vehicle location is snapped to a loop path, it will output duplicate segments with different index values, as shown in Figure 5.

Without identifying and deleting the improper index values, the result can produce problematic outputs downstream in the workflow including miscalculated transit metrics. Step 4 (i.e., Quality Check and Quality Assurance) prevents this from happening by assessing the trend of the index order of the identified undissolved segments and omitting cases that are out of place (Table 5). In Table 5, the trending index order goes from 193 to 225.

5 TABLEExample of duplicate segment (in orange)

<table><thead valign="top"><tr><th align="left">TripID</th><th align="left">Undiss_Ids (i.e., index value)</th><th align="left">Local time</th><th align="left">Vehicle movement</th><th align="left">Omit</th></tr></thead><tbody><tr><td align="left">54827047</td><td align="left">193</td><td align="left">2021‐09‐30 06:00:15</td><td align="left">0</td><td align="left">No</td></tr><tr><td align="left">54827047</td><td align="left">201</td><td align="left">2021‐09‐30 06:00:42</td><td align="left">1</td><td align="left">No</td></tr><tr><td align="left">54827047</td><td align="left">226</td><td align="left">2021‐09‐30 06:01:12</td><td align="left">2</td><td align="left">Yes</td></tr><tr><td align="left">54827047</td><td align="left">207</td><td align="left">2021‐09‐30 06:01:12</td><td align="left">2</td><td align="left">No</td></tr><tr><td align="left">54827047</td><td align="left">215</td><td align="left">2021‐09‐30 06:01:42</td><td align="left">3</td><td align="left">No</td></tr><tr><td align="left">54827047</td><td align="left">225</td><td align="left">2021‐09‐30 06:02:12</td><td align="left">4</td><td align="left">No</td></tr></tbody></table>

The fifth step (i.e., Enrich Data and append GTFS schedule) adds more features to the clean data structure. These features include the maximum index of the undissolved segments in the route, the maximum stop sequence, the number of stops remaining from the maximum stop sequence, arrival time and departure time from the stop_times.txt, and labeling the vehicle's movement status. The latter iteratively takes each consecutive trip_id pair and calculates the time delta of the distance traveled between them. If the distance traveled in time delta is <20 m, then this is labeled as "stationary," unless it is within proximity of the last stop of the route, in which case it is labeled as "terminus." If none of these conditions are met, then it is marked as "movement." The purpose of these subprocesses is to have a data structure prepared for the interpolation step.

Step 6 calculates the projected vehicle speed and estimated bus arrival time per transit stop from the start of the recorded trip_id to the end. First, the process has to classify further the type of movement in each consecutive trip_id pair based on the stop sequence and undissolved index value deltas. For instance, if the vehicle was at stop sequence 5 with an index value of 120 and the next update was at stop sequence 8 with an index value of 145, then it would be classified as going through multiple stops due to the stop sequence delta of three. On the other hand, if the update had a stop sequence delta of one or none, it would be classified as a single stop or same stop, respectively.

The idea of the first sub‐step is to expand the current data structure between consecutive trip_id pairs, particularly multiple and single stops that have not been recorded explicitly in the GTFS‐RT. Second, the process iteratively calculates the distance (in kilometers) between the consecutive trip_id pairs along the route path. Distance is derived by the undissolved index value deltas. The third sub‐step calculates projected speed (km/h) by taking the calculated distance divided by the time delta of the pair consecutive trip_id. Fourth, the estimated travel time (in seconds) per stop is calculated by taking the distance from the first recorded point (1st of the consecutive pair) to the stop, divided by the projected speed.

The estimated travel time is added cumulatively from the recorded first point's timestamp to each stop it passes through (if applicable) until reaching the second point (2nd of the consecutive pair). The final sub‐step (i.e., Step 7) calculates the time delta between the estimated arrival time and expected arrival time provided by stop_times.txt per stop. A positive time delta per transit stop may indicate that the vehicle arrived either on‐time or early, whereas a negative time delta indicates that the bus is late. These on‐time performance metrics are further refined in the next step. Figure 6 and Table 6 illustrate, respectively, the sub‐steps that need to be completed in order to calculate on‐time performance and a sample data structure of the output.

6 TABLESample output of the data structure after interpolate process

<table><thead valign="top"><tr><th align="left">TripID</th><th align="left">StopID</th><th align="left">Seq</th><th align="left">Stat</th><th align="left">Mvm</th><th align="left">Dist (km)</th><th align="left">Proj. speed (km/h)</th><th align="left">Rec. time</th><th align="left">Proj. travel time (s)</th><th align="left">Est. arr. time</th><th align="left">Exp. arr. time</th><th align="left">Arr. time delta (s)</th><th align="left">Vehicle pos.</th></tr></thead><tbody><tr><td align="left">57010620</td><td align="left">6025</td><td align="left">3</td><td align="left">0</td><td align="left">Rec.—Start</td><td align="char" char=".">0.122</td><td align="char" char=".">18.2</td><td align="left">6/28/21 19:22:00</td><td align="left">24</td><td align="left">6/28/21 19:22:24</td><td align="left">6/28/21 19:21:00</td><td align="left">−84</td><td align="left">−113.9860, 51.05922</td></tr><tr><td align="left">57010620</td><td align="left">7563</td><td align="left">4</td><td align="left">0</td><td align="left">Btwn.</td><td align="char" char=".">0.303</td><td align="char" char=".">18.2</td><td align="left" /><td align="left">60</td><td align="left">6/28/21 19:23:23</td><td align="left">6/28/21 19:22:00</td><td align="left">−83</td><td align="left" /></tr><tr><td align="left">57010620</td><td align="left">5872</td><td align="left">5</td><td align="left">0</td><td align="left">Rec.—End</td><td align="char" char=".">0.062</td><td align="char" char=".">18.2</td><td align="left">6/28/21 19:25:37</td><td align="left" /><td align="left" /><td align="left" /><td align="left" /><td align="left" /></tr><tr><td align="left">57010620</td><td align="left">5872</td><td align="left">5</td><td align="left">1</td><td align="left">Rec.—Start</td><td align="char" char=".">0.062</td><td align="char" char=".">34.8</td><td align="left">6/28/21 19:25:37</td><td align="left">6</td><td align="left">6/28/21 19:25:43</td><td align="left">6/28/21 19:23:00</td><td align="left">−163</td><td align="left">−113.9802, 51.05953</td></tr><tr><td align="left">57010620</td><td align="left">6227</td><td align="left">6</td><td align="left">1</td><td align="left">Rec.—End</td><td align="char" char=".">0.059</td><td align="char" char=".">34.8</td><td align="left">6/28/21 19:27:06</td><td align="left" /><td align="left" /><td align="left" /><td align="left" /><td align="left" /></tr></tbody></table>

As an example of the on‐time performance for a specific trip, in the 1st row of Table 6, TripID 57010620 was first recorded by the GTFS‐RT on 6/28/2021 at 19:22:00. The approximate location of the bus is identified as enroute to StopID 6025 with a remaining distance of 0.122 km. The next recording (3rd row) is used to calculate speed (Proj. Speed) based on distance traveled over time delta. Projected speed and the remaining distance are used to estimate the arrival time (Est. Arr. Time), which is 19:22:24. The expected arrival time (Exp. Arr. Time) for StopID 6025 is 19:21:00. Hence, the arrival time delta indicates the bus would arrive 84 s later than scheduled. The 2nd row shows the movement status (Mvm) set to "between" (Btwn.) because the vehicle passed by StopID 7563 with no recording. The first consecutive recording is set to a movement status (Stat) of 0. This accumulates by 1 for every nth consecutive recording.

After interpolation, the penultimate step (i.e., calculate metrics) calculates the actual headway and extracts the results of on‐time performance. The output for actual headway is computed by subtracting the expected arrival time interval (i.e., expected headway) between two different consecutive trip_ids from the estimated arrival time interval (i.e., estimated headway) (Table 7). If the actual headway is greater than the expected headway, then it is considered that the trip_ids (i.e., vehicles) are further apart than expected. Whereas, if the actual headway is less than the expected headway, then consecutive buses are more bunched than they would be if the system is operating efficiently.

7 TABLESample data structure of expected headway versus actual headway

<table><thead valign="top"><tr><th align="left">StopID</th><th align="left">StpSeq</th><th align="left">TripID1</th><th align="left">TripID2</th><th align="left">Headway_Expected</th><th align="left">Headway_Actual</th></tr></thead><tbody><tr><td align="left">6025</td><td align="left">3</td><td align="left">57010622</td><td align="left">57010620</td><td align="left">35</td><td align="left">38.7</td></tr><tr><td align="left">7563</td><td align="left">4</td><td align="left">57010622</td><td align="left">57010620</td><td align="left">35</td><td align="left">38.4</td></tr><tr><td align="left">5872</td><td align="left">5</td><td align="left">57010622</td><td align="left">57010620</td><td align="left">35</td><td align="left">40.7</td></tr><tr><td align="left">6227</td><td align="left">6</td><td align="left">57010622</td><td align="left">57010620</td><td align="left">35</td><td align="left">38.2</td></tr><tr><td align="left">4330</td><td align="left">7</td><td align="left">57010622</td><td align="left">57010620</td><td align="left">35</td><td align="left">36</td></tr><tr><td align="left">4329</td><td align="left">8</td><td align="left">57010622</td><td align="left">57010620</td><td align="left">35</td><td align="left">34.7</td></tr><tr><td align="left">4328</td><td align="left">9</td><td align="left">57010622</td><td align="left">57010620</td><td align="left">35</td><td align="left">33.3</td></tr><tr><td align="left">6228</td><td align="left">10</td><td align="left">57010622</td><td align="left">57010620</td><td align="left">35</td><td align="left">32.5</td></tr></tbody></table>

Currie et al. (2012) noted that the Independent Transport Safety and Reliability Regulator (ITSRR) of New South Wales, the Department of Victoria, and Transport for London (TfL) each used an on‐time threshold of 2 min early and 5 min late. Whereas, 83 transit agencies in the United States adopted 1 min early and 5 min late thresholds. Hence, there is no global standard for on‐time performance, but rather it tends to be defined differently by different agencies.

For the purposes of this article, on‐time performance is evaluated from the interpolation process within a 2 to 5 min time frame by classifying each observation as early (>5 min), on‐time (−2 to 5 min), or late (<−2 min). Importantly, these thresholds can be easily changed to review the impact of different times on the same data. In this case, the evaluation results in two final shapefile outputs. The first is the percentage of on‐time buses per stop per hour per route, and the second is similar except aggregated per day. Formatted date, average speed and arrival time, count of trips (i.e., trip_ids of that route) that are early, late, and on‐time, and total trips are additional features in both outputs. The final step of the workflow exports both outputs of on‐time performance and headway in JSON format to a MongoDB datastore. Details of the data storage component are discussed in the next section.

Data storage

GTFS‐RT data tend to come at high volume and high velocity and therefore the selected data store must be able to handle potentially huge amounts of data that arrive in a short period of time. Specifically, the selected data store needs to be horizontally scalable to support read/write operations in parallel as the size of the stored data and number of end users querying the data are both likely to grow over time. When a system is horizontally scalable, it supports adding more servers to the resource pool as needed. Although relational databases are vertically scalable, meaning they support increasing the processing power of a single server, most relational databases are not horizontally scalable.

Hence, when it comes to the need for a highly scalable distributed storage system, NoSQL data stores stand out. MongoDB is an open‐source NoSQL document‐based data store that is capable of handling large volumes of structured, semi‐structured, or even unstructured data efficiently. It provides a reliable solution through an implementation known as a replica set. This implementation is a collection of MongoDB instances that all store identical data, and as such it provides a highly available form of data storage that has high fault tolerance and can survive in the event of some of its components failing. Another benefit of MongoDB is its ability to store spatial data efficiently through the GeoJSON object types.

Although the data collected for the workflow described in this article are small in the context of "big" data (e.g., 100 GB+), with continued temporal collection of these data from Calgary Transit and if multiple users want to query the data simultaneously, a highly scalable NoSQL solution such as MongoDB is relevant. For this research, MongoDB was installed in a 4‐core CPU Azure VM with Ubuntu Server 20.04 to store data the GTFS data after processing. Using PyMongo (https://pymongo.readthedocs.io/en/stable/) and the ArcGIS API for Python packages, a custom tool was successfully created to query transit metrics on‐the‐fly, convert the query return to a spatial dataframe, and visualize the results in a custom PT service performance dashboard.

DATA VISUALIZATION

In order to visualize the on‐time reliability of bus arrivals and departures at all stops on the network for route and adjustable time‐specific queries, a web GIS dashboard was developed using Esri Calcite maps with accompanying descriptive analytics using the ArcGIS API for JavaScript. The core metrics used to assess reliability include "on‐the‐fly" average percentage of on‐time arrivals, vehicle speed (km/hr), and arrival time variance (seconds). The user interface for the dashboard is shown in Figure 7. The metrics respond dynamically to parameters such as the map's current geographic extent (as the user zooms and pans), the selection of a specific bus route, and an hourly animation that displays spatiotemporal changes in system performance. In a subsequent implementation, headway metrics will be added to the dashboard.

To display core metrics on the dashboard, the user is required first to select a date range. The date range is contingent on the available data stored in the MongoDB, which in this case ranges from January 2021 until the present‐day of collection. Using the date range query, the time slider widget automatically populates the start and end date with the default interval set to hourly. The user can freely change the interval extent. Animating the hourly intervals from the time slider can be automatically executed by clicking on the play/pause button or manually moved by the user. The example in Figure 7 shows a date range queried from March 15th until March 16th, 2021 with the time slider set to the hour of 6 a.m. (i.e., 6 a.m. Mountain Time). Thus, the results show the entire transit network with an overall average on‐time performance of 85%, vehicle speed of 29 km/h, and an arrival time at the stops of −2 s.

The date range data can be queried using the optional parameters of day of the week (e.g., Mondays, Wednesdays) and/or the selection of a specific bus route to explore more granular spatiotemporal patterns. For example, a user may want to investigate transit reliability during Wednesday rush hour between March 15th and April 15th, 2021, to investigate temporal variations caused by localized weather, which Calgary is known for due to its high altitude and chinook winds (Gough, 2008; Hasan & Barker, 1999; Zhou et al., 2017). Alternatively, the selection of a specific bus route has the capability to identify spatiotemporal patterns along its length over the prescribed time period, to identify intra‐route performance variations. These insights can help transit planners identify potential causes of performance degradation, such as traffic congestion, the number of traffic signals/intersections and left turns, and the location of known traffic incidents.

Figure 8 illustrates the path of bus route 149‐1490022 and its on‐time performance during the hour of 6 a.m.–7 a.m. on March 15th, 2021. Poor reliability performance (shown in red) appears to be related to traffic signal intersections and the number of left turns (two) that the bus passes through. Traffic congestion is another likely cause of the poor performance, as the bus path goes through several traffic signal intersections that connect to highway entrance and exit ramps. The workflow's data harvesting, processing, and visualization are continuous, however, using the dashboard to query the data for particular timeframes, and along particular routes demonstrates the potential of the dashboard to help planners identify and diagnose reliability issues across the PT network.

The dashboard will be further enhanced to show crosstabulation of descriptive analytics as requested by the user. This will allow transit planners to visualize PT reliability patterns across the network at different temporal and spatial scales, in order to identify particular problem areas. These data visualizations are currently generated outside the dashboard and include how consistently reliable the PT network is over time (Figure 9), seasonal variations in reliability (Figures 10 and 11), and which routes are performing the best (Figure 12) and worst (Figure 13) over time. Examples of these data visualization outputs are presented here for data collected during the study period reported in 2021.

Figure 9 shows that for the week of March 15th through the 22nd, 2021, the bus system functioned better (i.e., more than 80% of buses arriving on‐time) in the morning extended rush hour period (6–10 a.m.) than in the mid‐afternoon (2–4 p.m.) time period. This pattern is consistent across all days, and is likely attributable to localized traffic conditions (e.g., congestion, commute behaviors).

Figure 10 shows seasonal variations in on‐time performance of the transit network from March to August 2021. Interestingly, for a city with abrupt and significant weather variations, there are no apparent patterns in performance, other than a general service degradation in reliability during the weeks of June 14–18th and August 16–20th. While there were no specific local events during these service degradation periods, such as the annual Calgary Stampede, they do broadly coincide with changes in local COVID‐19 lockdown procedures (CBC News, 2021; CTV News Calgary, 2021). Transportation and PT data during the COVID‐19 pandemic has been anomalous (Statistics Canada, 2021), and may account for the Calgary Transit service degradations seen in the summer of 2021.

Figure 11 shows the overall average of performance on an hourly basis on each day of the week from March to August 2021. Performance patterns are consistent and show that the transit network is less reliable on weekdays in the afternoon, particularly on Wednesdays and Fridays. This suggests that transit planning efforts might be productively focused on improving service during these time periods. The cause of the poor reliability performance for both weekdays, especially Fridays, is likely due to commuter behavior and resulting congestion (Maciag, 2012). Possible interventions to lessen these types of temporal service disruptions include transit scheduling techniques (e.g., increase the frequency of bus services temporarily), implementing signal preemption for PT vehicles, temporarily adjusting routes (e.g., no left turns, less stops), and/or implementing disincentives for the use of private vehicles (e.g., congestion charge, parking costs) (Boston Region Metropolitan Planning Organization, 2018; State of Florida Department of Transportation, 2008).

Figures 9 through 11 focus on temporal variation in the on‐time performance of buses for the entire transit network in Calgary. More specifically, Figures 12 and 13 limit the focus to the 20 best and 20 worst performing transit routes, respectively, including their overall average on‐time performance as well as the performance across their daily operating hours. Both figures depict the week of March 15th–22nd, 2021 as an example.

In both figures, there is significant variance in performance across routes, and over the operating hours. In some cases, even the overall best performing routes (Figure 12) demonstrate periods of low performance reliability. For instance, bus route 44‐440015 has perfect on‐time performance between 6 and 9 a.m., but within the hour starting at 10 a.m. the performance is closer to 0%. Similar anomalies appear with the worst overall performing routes, where some reveal instances of high reliability, such as bus routes 78‐780066 and 406‐4060054. There is a case where two buses that operate on the same route, but in opposite directions have, to some degree, significantly different variances. This can be seen with bus routes 21‐210021 (inbound—worst) and 55‐550021 (outbound—better) between operating hours 7 and 10 a.m. (Figure 13). A plausible explanation could be the direction of the high demand traffic during morning hours.

From a transit planning perspective, almost all of the bus routes in Figures 12 and 13 show consistent reliability over time. This provides planners with the opportunity to investigate why some routes are consistently more reliable than others across both time and space. Possible explanations can be derived from the outputs of the existing workflow including length and sinuosity of the routes, average distance between consecutive bus stops, actual headway, and average travel speed.

Other possible explanations, such as weather, traffic incidents, number of stop signs, traffic signals, and left turns can be extracted and analyzed from publicly available data sources. An illustration of the impact of these factors can be seen in Figure 8. It displays the third least reliable route (bus route 149‐1490022) with nearly every operating hour in the red. Based on the map, it is clear that the path (west to northeast) of the bus route passes through three major traffic intersections and has to make three left turns. A predictive model, such as a random forest regression, will be implemented in a future iteration of the dashboard to validate the impact of the factors identified above. The predictive outputs can then be used to inform the route planning and implementation practice within a transit agency. In particular, the results can be used to adjust the schedule to align expectations and improve reliability. Such adjustments can be implemented in a transit app to inform potential riders of the per stop and transit route historic probability of the bus arriving late (or early) from the scheduled time, the new expected time to arrive, and the future probability in near‐real time of the bus being late/early.

The interactive dashboard (Figures 7 and 8) and the accompanying summary graphics (Figures 9–13) discussed in this section demonstrate the potential of the developed workflow to generate meaningful insights to inform transit planners. Collectively, the use of parallel processing and the spatial and data visualizations parse large volumes of complex data to create accurate metrics about the reliability performance of the PT service from the system level, down to individual transit stops. Future work will build upon and extend the visualization capabilities and dashboard described in this article to include additional metrics, and the aforementioned predictive analytic capabilities.

CONCLUSIONS

The primary goal of the research described in this article was to develop a method for harvesting high‐volume, high‐frequency GTFS‐RT data, and processing it into reliability metrics to assist PT agencies to monitor and improve their transit systems. From a technological perspective, an important sub‐goal of the work was to develop a workflow to harvest and process these data quickly and efficiently using emerging cloud computing and parallel processing approaches.

When working with high‐frequency and high‐volume data, there is always a challenge to make sense out of the complexity, so that findings can be understood and applied to both system‐wide and route‐specific improvements. There are two primary ways to reduce the complexity of the data so that planners can more effectively use it in their decision‐making. The first is visualization of the data, using spatial and data visualizations to reveal patterns and relationships. Ideally, the visualization will include a form of interactivity such that the patterns and possible relationships are revealed through active interrogation of the visualization. This article presented the development of an interactive Web‐based dashboard for spatial visualization of the transportation system from the system level down to the individual stop level, combined with data visualizations of metrics of reliability. Development and improvement of this dashboard will continue in future research.

The second method for reducing data complexity is through statistical approaches for reducing dimensionality. While this article focuses on descriptive visualizations of reliability trends, the use of parametric techniques, including exploratory, predictive, and AI‐based modeling, presents a compelling avenue for future research, especially since the parallel processing workflow outputs described herein should facilitate their development.

One of the ancillary goals of this research was to make the developed tools openly available, so that transit agencies and other researchers can adapt and extend the workflow to meet their own needs. To that end, the workflow is documented, and all of the associated code is freely available at: https://github.com/highered‐esricanada/Parallel_GTFS_Workflow.

One clear trajectory for future research and development is to extend the workflow to include more of the reliability metrics identified by Currie et al. (2012) and described in Table 1. This would both expand the breadth of the available metrics for assessing the PT service network and, potentially, improve the accuracy of the existing on‐time performance metric. Additionally, the availability of data sources beyond AVL, such as ridership or rider survey data and associated network traffic data, could significantly improve the diagnostic capabilities of the workflow.

Another avenue for further research is to incorporate data on external factors, such as weather and events, and internal factors such as route characteristics (e.g., sinuosity, length) to correlate reliability issues with their possible causes. This would allow for the development of predictive analytics to anticipate reliability problems (including potential problems for new routes) so that they can be addressed proactively. As an extension of the predictive analytics, a transit app could be implemented in the future with the ability to display the original expected time schedule and its historic probability of a late/early arrival compared with the current (i.e., adjusted) expected time with the near real‐time probability of being late/early based on current conditions.

Overall, the parallel processing workflow presented in this article represents an important contribution of this research to the literature. It provides a significant advancement in the processing and management of high‐volume, high‐frequency GTFS‐RT data. The workflow can facilitate the future research directions identified above and provide other researchers and transit agencies with an effective basis for developing their own methods for evaluating transit systems.

ACKNOWLEDGMENTS

The authors wish to thank Dr. Steven Farber and two anonymous reviewers for strengthening this research.

CONFLICT OF INTEREST

All authors declare that they have no conflicts of interest.

DATA AVAILABILITY STATEMENT

The data that support the findings of this study are openly available in Parallel‐GTFS‐Workflow at https://github.com/highered‐esricanada/Parallel‐GTFS‐Workflow.

REFERENCES

1 Amdahl, G. M. (1967). Validity of the single processor approach to achieving large scale computing capabilities. In Proceedings of the Spring Joint Computer Conference, Atlantic City, NJ (pp. 483 – 485). New York, NY : ACM.

2 Anagnostopoulos, C. N. E., Anagnostopoulos, I. E., Loumos, V., & Kayafas, E. (2006). A license plate‐recognition algorithm for intelligent transportation system applications. IEEE Transactions on Intelligent Transportation Systems, 7 (3), 377 – 392. https://doi.org/10.1109/TITS.2006.880641

3 Arriagada, J., Gschwender, A., Munizaga, M. A., & Trépanier, M. (2019). Modeling bus bunching using massive location and fare collection data. Journal of Intelligent Transportation Systems, 23 (4), 332 – 344. https://doi.org/10.1080/15472450.2018.1494596

4 Bast, H., Brosi, P., & Storandt, S. (2014). Real‐time movement visualization of public transit data. In Proceedings of the 22nd ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, Dallas, TX (pp. 331 – 340). New York, NY : ACM. https://doi.org/10.1145/2666310.2666404

5 Battarra, R., Gargiulo, C., Tremiterra, M., & Zucaro, F. (2018). Smart mobility in Italian metropolitan cities: A comparative analysis through indicators and actions. Sustainable Cities and Society, 41, 556 – 567. https://doi.org/10.1016/j.scs.2018.06.006

6 Boston Region Metropolitan Planning Organization. (2018). Transit signal priority in the Boston region: A guidebook. Retrieved from https://www.ctps.org/data/calendar/pdfs/2019/TSP‐Guidebook.pdf

7 Bueti, C., & Faulkner, D. (2013). ICTs as a key technology to help countries adapt to the effects of climate change. Washington, DC : World Resources Institute.

8 Cats, O., & Jenelius, E. (2014). Dynamic vulnerability analysis of public transport networks: Mitigation effects of real‐time information. Networks and Spatial Economics Springer, 14 (3), 435 – 463. https://doi.org/10.1007/s11067‐014‐9237‐7

9 CBC News. (2021). Here are the latest COVID‐19 statistics for Alberta and what they mean. Retrieved from https://www.cbc.ca/news/canada/calgary/alberta‐covid‐19‐data‐statistics‐numbers‐cases‐hospitalizations‐1.5514947

Chakrabarti, S., & Giuliano, G. (2015). Does service reliability determine transit patronage? Insights from the Los Angeles Metro bus system. Transport Policy, 42, 12 – 20. https://doi.org/10.1016/j.tranpol.2015.04.006

Cohen‐Blankshtain, G., & Rotem‐Mindali, O. (2016). Key research themes on ICT and sustainable urban mobility. International Journal of Sustainable Transportation, 10 (1), 9 – 17. https://doi.org/10.1080/15568318.2013.820994

Cramer, A., Cucarese, J., Tran, M., Lu, A., & Reddy, A. (2009). Performance measurements on mass transit: Case study of New York City transit authority. Transportation Research Record, 2111 (1), 125 – 138. https://doi.org/10.3141/2111‐15

CTV News Calgary (2021). Potentially heavy showers on the way for Calgary. Retrieved from https://calgary.ctvnews.ca/potentially‐heavy‐showers‐on‐the‐way‐for‐calgary‐1.5548638

Currie, G., Douglas, N. J., & Kearns, I. (2011). An assessment of alternative bus reliability indicators. In Australasian Transport Research Forum 2012 Proceedings, Perth, Australia (pp. 1 – 20).

Desaulniers, G., & Hickman, M. (2007). Chapter 2 Public Transit. In Barnhart, C., Laporte, G., eds. Handbooks in operations research and management science (Vol. 14, pp. 69 – 127). Amsterdam : Elsevier.

Figliozzi, M. A., Feng, W., Lafferriere, G., & Feng, W. (2012). A study of headway maintenance for bus routes: Causes and effects of ' bus bunching ' in extensive and congested service areas. Portland, OR : Transportation Research Education Center. Retrieved from https://rosap.ntl.bts.gov/view/dot/24701

Google. (2021). Google transit APIs. Retrieved from https://developers.google.com/transit/gtfs‐realtime/reference

Gough, W. A. (2008). Theoretical considerations of day‐to‐day temperature variability applied to Toronto and Calgary, Canada data. Theoretical and Applied Climatology, 94, 97 – 105. https://doi.org/10.1007/s00704‐007‐0346‐9

Harsha, M. M., Mulangi, R., & Kumar, D. (2019). Analysis of bus travel time variability using automatic vehicle location data. Transportation Research Procedia, 48, 3283 – 3298. https://doi.org/10.1016/j.trpro.2020.08.123

Hasan, Y., & Barker, D. (1999). The impact of unseasonable or extreme weather on traffic activity within Lothian region, Scotland. Journal of Transport Geography, 7, 209 – 213. https://doi.org/10.1016/S0966‐6923(98)00047‐7

Hu, W., & Shalaby, A. (2017). Use of automated vehicle location data for route‐ and segment‐level analyses of bus route reliability and speed. Transportation Research Record, 2649, 9 – 19. https://doi.org/10.3141/2649‐02

Kunama, N., Worapan, M., Phithakkitnukoon, S., & Demissie, M. (2017). GTFS‐Viz: Tool for preprocessing and visualizing GTFS data. In Proceedings of the 2017 ACM International Joint Conference on Pervasive and Ubiquitous Computing and Proceedings of the 2017 ACM International Symposium on Wearable Computers, Maui, HI (pp. 388 – 396). New York, NY : ACM. https://doi.org/10.1145/3123024.3124415

Levinson, H. (1991). Supervision strategies for improved reliability of bus routes. Washington, DC : Transportation Research Board.

Luo, Q., Gee, M., Piccoli, B., Work, D., & Samaranayake, S. (2022). Managing public transit during a pandemic: The trade‐off between safety and mobility. Transportation Research Part C: Emerging Technologies, 138, 103592. https://doi.org/10.1016/j.trc.2022.103592

Maciag, M. (2012). The best days to commute in metro areas. Retrieved from https://www.governing.com/archive/best‐days‐to‐commute‐drive‐metro‐areas.html

Mazloumi, E., Currie, G., & Rose, G. (2009). Using GPS data to gain insight into public transport travel time variability. Journal of Transportation Engineering, 136 (7), 623 – 631. https://doi.org/10.1061/(ASCE)TE.1943‐5436.0000126

Mesbah, M., Lin, J., & Currie, G. (2015). "Weather" transit is reliable? Using AVL data to explore tram performance in Melbourne, Australia. Journal of Traffic and Transportation Engineering, 2 (3), 125 – 135. https://doi.org/10.1016/j.tte.2015.03.001

Morfoulaki, M., Myrovali, G., & Kotoula, K. (2015). Increasing the attractiveness of public transport by investing in soft ICT based measures: Going from words to actions under an austerity backdrop—Thessaloniki's case, Greece. Research in Transportation Economics, 51, 40 – 48. https://doi.org/10.1016/j.retrec.2015.07.006

OECD. (2021, February 22). COVID‐19 and a new resilient infrastructure landscape. Paris, France : Organisation for Economic Co‐operation and Development. Retrieved from https://read.oecd‐ilibrary.org/view/?ref=1060_1060483‐4roq9lf7eu&title=COVID‐19‐and‐a‐new‐resilient‐infrastructure‐landscape

Pan American Health Organization (2016). Maintenance of essential services. Retrieved from https://www.paho.org/disasters/dmdocuments/RespToolKit%5f24%5fTool%2016%5fMaintenanceofEssentialServices.pdf

Pee, L. G., & Pan, S. (2021). Climate‐intelligent cities and resilient urbanisation: Challenges and opportunities for information research. International Journal of Information Management, 63, 102446. https://doi.org/10.1016/j.ijinfomgt.2021.102446

Perk, V., Flynn, J., & Volinski, J. (2008). Transit ridership, reliability, and retention (Publication NCTR‐776‐07). Tallahassee, FL : Florida Department of Transportation.

Politis, I., Papaioannou, P., Basbas, S., & Dimitriadis, N. (2010). Evaluation of a bus passenger information system from the users' point of view in the city of Thessaloniki, Greece. Research in Transportation Economics, 29, 249 – 255. https://doi.org/10.1016/j.retrec.2010.07.031

Polus, A. (1978). Modeling and measurements of bus service reliability. Transportation Research, 12 (4), 253 – 256. https://doi.org/10.1016/0041‐1647(78)90067‐9

Prommaharaj, P., Phithakkitnukoon, S., Demissie, M., Kattan, L., & Ratti, C. (2020). Visualizing public transit system operation with GTFS data: A case study of Calgary, Canada. Heliyon, 6 (4), e03729. https://doi.org/10.1016/j.heliyon.2020.e03729

Sneader, K., & Lund, S. (2020). COVID‐19 and climate change expose dangers of unstable supply chains. Los Angeles, CA : McKinsey Global Institute.

Snellen, D., & Hollander, G. (2017). ICT's change transport and mobility: Mind the policy gap! Transportation Research Procedia, 26, 3 – 12. https://doi.org/10.1016/j.trpro.20217.07.003

State of Florida Department of Transportation. (2008). Transit ridership, reliability, and retention. Retrieved from https://www.nctr.usf.edu/pdf/77607.pdf

Statistics Canada (2021). Public transit in a post‐COVID‐19 Canada. Retrieved from https://www150.statcan.gc.ca/n1/en/pub/45‐28‐0001/2021001/article/00030‐eng.pdf?st=duthQO3l

Stewart, C., Diab, E., Bertini, R., & El‐Geneidy, A. (2016). Perspectives on transit: Potential benefits of visualizing transit data. Transportation Research Record, 2544 (1), 90 – 101. https://doi.org/10.3141/2544‐11

Strathman, J., & Hopper, J. (1993). Empirical analysis of bus transit on‐time performance. Transportation Research Part A—Policy and Practice, 27, 93 – 100. https://doi.org/10.1016/0965‐8564(93)90065‐S

Tirachini, A., Godachevich, J., Cats, O., Muñoz, J., & Soza‐Parra, J. (2021). Headway variability in public transport: A review of metrics, determinants, effects for quality of service and control strategies. Transport Reviews, 1 – 25. Epub ahead of print. https://doi.org/10.1080/01441647.2021.1977415

United Nations. (2015). Information and communication technology for urban climate action. Retrieved from https://unhabitat.org/sites/default/files/download‐manager‐files/Information%20and%20Communication%20Technology%20for%20Urban%20Climate%20Action.pdf

Wang, L., Xue, X., Zhao, Z., & Wang, Z. (2018). The impacts of transportation infrastructure on sustainable development: Emerging trends and challenges. International Journal of Environmental Research and Public Health, 15 (6), 1172. https://doi.org/10.3390/ijerph15061172

Wessel, N., Allen, J., & Farber, S. (2017). Constructing a routable retrospective transit timetable from a real‐time vehicle location feed and GTFS. Journal of Transport Geography, 62, 92 – 97. https://doi.org/10.1016/j.jtrangeo.2017.04.012

Wessel, N., & Farber, S. (2019). On the accuracy of schedule‐based GTFS for measuring accessibility. The Journal of Transport and Land Use, 12 (1), 475 – 500. https://doi.org/10.5198/jtlu.2019.1502

Wood, D. (2015). A framework for measuring passenger‐experienced transit reliability using automated data. Department of Civil and Environmental Engineering, MIT. Retrieved from https://dspace.mit.edu/bitstream/handle/1721.1/99539/924822213‐MIT.pdf;sequence=1

Zhao, Y. (1997). Vehicle location and navigation systems. Norwood, MA : Artech House.

Zhou, M., Wang, D., Li, Q., Yue, Y., Tu, W., & Cao, R. (2017). Impacts of weather on public transport ridership: Results from mining data from different sources. Transportation Research Part C: Emerging Technologies, 75, 17 – 29. https://doi.org/10.1016/j.trc.2016.12.001

By Anastassios Dardas; Brent Hall; Jon Salter and Hossein Hosseini

Reported by Author; Author; Author; Author

Treffer: A geospatial workflow for the assessment of public transit system performance using near real‐time data

Weitere Informationen

A geospatial workflow for the assessment of public transit system performance using near real‐time data

LITERATURE REVIEW

DATA

DATA WORKFLOW

Data harvesting

Data processing

Data storage

DATA VISUALIZATION

CONCLUSIONS

ACKNOWLEDGMENTS

CONFLICT OF INTEREST

DATA AVAILABILITY STATEMENT

REFERENCES

Links

Zusatz-Funktionen