knitr::opts_chunk$set(echo = F)
library(sf)
## Warning: package 'sf' was built under R version 4.4.2
## Linking to GEOS 3.12.2, GDAL 3.9.3, PROJ 9.4.1; sf_use_s2() is TRUE
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
pgh_map <- read_sf("pgh-neighborhoods.geojson")
allegheny_county_map <- read_sf("allegheny-county.geojson")
prt_2019_2021_stops <- read_csv("pgh-stops-2019-2021.csv")
## Rows: 6829 Columns: 26
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (14): StopID, StopName, Timepoint, Direction, Routes_ser, Mode, Shelter,...
## dbl (12): X, Y, FID, CleverID, NumberofRo, Latitude, Longitude, Ons_AvgWkd, ...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
colnames(prt_2019_2021_stops)[5] <- "stop_id"
prt_2024_stops <- read_csv("pgh-stops-2024.csv")
## Rows: 6586 Columns: 21
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (5): stop_id, stop_name, direction, routes, Stop_type
## dbl (16): FID, stop_code, stop_lat, stop_lon, num_routes, fy24_avg_o, fy24_a...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
colnames(prt_2024_stops)[c(8, 9)] <- c("num_routes_2024", "routes_2024")
prt_performance <- read_csv("pgh-transit-performance.csv")
## Rows: 22020 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): route, ridership_route_code, route_full_name, current_garage, mode...
## dbl  (3): _id, year_month, on_time_percent
## date (1): month_start
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Introduction and Motivation

Pittsburgh Regional Transit manages bus, light rail, and incline public transit services for Allegheny County (the Pittsburgh Metro Area). As students of Carnegie Mellon University who rely on public transit to get around the area, it is informative and helpful for us to do a statistical health check on the transit system in Pittsburgh since it affects our livelihood. By doing statistical analysis on different aspects of transit use, such as ridership, on-timeness, and coverage, we can be better informed on how PRT is doing as a whole, which is important for optimizing transit usage and giving constructive feedback to PRT. We will focus on the stops within Pittsburgh since this is where most of PRT operates.

Data Description

Source + Variables

For this project, we drew from three sources of data. All of these datasets are from Pittsburgh Regional Transit.

Stop Info 2019 + 2021

https://open-data-pgh-transit.hub.arcgis.com/maps/5e52b474e96c42f1a049be65de19fe93

This dataset describes ridership of Pittsburgh’s public transportation system in both 2021 and 2019 (pre-pandemic). Each row describes a stop. The columns contain identifying information for each stop, geographic information, information about the type of stop, and information about ridership at the stop for both time periods. The following variables are relevant to our analyses:

  • Identifying information:
    • Name (Categorical, nominal)
    • Routes served (Categorical, nominal)
      • Stored as a comma-separated list of public transit routes that go through this stop
    • Number of routes served (Quantitative, discrete)
  • Geographic information:
    • Longitude (Quantitative, continuous)
    • Latitude (Quantitative, continuous)
    • Neighborhood (Categorical, nominal)
  • Stop type information:
    • Direction (Categorical, nominal)
      • Can take on the following values: inbound, outbound, or both.
    • Stop type (Categorical, nominal)
      • Can take on the following values:
        • Bus Stop, Bus Stop with Non-PAAC Shelter, Bus Stop with PAAC Shelter, Light Rail Station, Busway Station, Light Rail Stop with PAAC Shelter, Busway Stop with PAAC Shelter, Incline Station, Busway Stop, Light Rail Stop
  • Ridership Information:
    • Average Weekday Ons (Quantitative, discrete)
    • Average Weekday Offs (Quantitative, discrete)
      • There’s a column for average weekday ons/offs for the 2019 calendar year and another for the 2021 calendar year.

Stop Info 2024

https://open-data-pgh-transit.hub.arcgis.com/datasets/523d7d2f4aa64b92b95ccf9d94edc2cb_0/explore

This dataset contains the same information as the 2019-2021 dataset but for the 2024 up until October.

On Time Performance

https://data.wprdc.org/dataset/port-authority-monthly-average-on-time-performance-by-route

This dataset contains the monthly on-time percentage for transit routes, each month from July 2017 to September 2024. It includes identifying information about each route, data related to the time period of the collected data, and data about the on-time performance of the routes. There are separate rows for on-time performance on Saturdays, Sundays, and weekdays. The following variables are relevant to our analyses:

  • Identifying Information:
    • Route (Categorical, nominal)
      • The name of the route (e.g. 61A)
  • Time Data:
    • Month_start (Categorical, ordinal)
      • The date of the time range that the average on-time percent was calculated from. In YYYY-MM-DD format.
      • Day_type (Categorical, nominal)
        • The type of data being tracked, Saturdays only, Sundays only, or weekdays only.
  • Performance Data:
    • On Time Percent (Quantitative, continuous)
      • The percentage of time within the month-long time period that vehicles in this route departed on-time from their initial stop or station.

EDA

Route On Time Percentage Distribution In 2024

The distribution of on time percentages for PRT transit routes is relatively symmetric and approximately normal shaped. The average on time percentage is 70%, and most on time percentages are between 60-80%, indicating that buses will typically make it to their stop on time, but it is also fairly common for most transit to be late.

Allegheny County Transit Stops Map

This map shows where all the Pittsburgh Regional Transit stops for buses, light rail, and inclines are located in Allegheny County. There is a notably dense cluster of bus stops within Pittsburgh, though there are some notably long, distinct routes that extend to the extremities of Allegheny county, connecting them to the center of pittsburgh. One other interesting note is that some of the bus stop lines roughly follow the boundaries of the three rivers, as indicated by their pattern.

Pittsburgh Transit Stops Map

## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).

This is a map of PRT stops just within Pittsburgh. There is a large density of stops within downtown, and the rest of the stops appear to follow major roads that lead into downtown. The stops are notably unevenly distributed – some neighborhoods appear to have very few number of stops, despite being very large.

Research Questions

How well are different neighborhoods served (on time-ness + # of routes)?

## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).

In general, it looks like the ridership, as indicated by the ons + offs at a stop are inversely correlated with on time averages – namely, the parts of Pittsburgh with many large points also tend to have colors closer to the lower end of on time averages, while neighborhoods with small points have much more orange colored dots. However, this isn’t strictly the case – it also appears that much of east Pittsburgh simply has worse bus performance than the west/south parts. Much of the lateness looks associated with entering downtown, as it looks like many transit lines on the east side are connected to downtown and then branch outwards, but the picture isn’t quite so clear, as downtown has a number of stops with high on time averages as well.

The routes per neighborhood is fairly skewed across Pittsburgh, with many neighborhoods, even those geographically close to downtown having a low number of routes that pass through them, while a few neighborhoods contain many different routes. The number of routes by neighborhood is not necessarily related to the distance from downtown, as some of the neighborhoods directly adjacent to downtown have very few routes going through them, while neighborhoods like Oakland, Shadyside, and Mount Washington are all served by many routes. Looking at this map in the context of the previous one, this is also not completely related to the number of bus users either, as places like squirrel hill and south side have many bus users but not nearly as many stops. The overall picture shows that bus service for different neighborhoods is not as intuitive as one might think.

Conclusions

Deciding where to live based on public transit accessibility is not simply a matter of looking at simple distances from hubs of the city. Pittsburgh has uneven transit accessibility, which is unsurprising given that population is also uneven across the city, but it isn’t necessarily clear what parts of Pittsburgh would be well-served from looking at a bare map. It looks like the best served neighborhood with respects to route quantity and on-timeness would be Mount Washington and Brookline, as they are both served by a fair number of routes (10-20), which means it is easy to get around to other places, and the bus routes within those areas are on time much more often than other neighborhoods with a similar number of route options.

What factors relate to bus on time-ness (e.g. number of stops)?

## Warning: NAs introduced by coercion
## Warning: NAs introduced by coercion
## corrplot 0.95 loaded

To evaluate what factors may contribute to timeliness of buses within PRT we will perform a linear regression model and an analysis of variance (ANOVA) test. To do so we must first choose variables. Among the categorical variables we will evaluate neighborhood, stop type, and direction, choosing to ignore shelter since it highly correlates with stop type. Among the quantitative variables we have created a heat map of correlated predictors, removing most of the overly correlated predictors (almost all of them). The only difficult decision was to choose between offs and ons from 2024 and they were both heavily correlated with eachother (and moderately with num routes), and we have chosen to keep ons as we better understand the methodology of tabulation (card swipes upon entry of the bus while we do not know how offs were counted). The second heat map shows the moderate correlation, but it is not too unreasonable and we now proceed with linear regression:

## Analysis of Variance Table
## 
## Response: ontime_avg_2024
##                        Df  Sum Sq  Mean Sq F value    Pr(>F)    
## direction               2  0.0479 0.023963 10.3356 3.386e-05 ***
## type                    2  0.0592 0.029584 12.7599 3.065e-06 ***
## neighborhood           87 10.3868 0.119388 51.4941 < 2.2e-16 ***
## num_routes_2024         1  0.0080 0.007983  3.4434   0.06362 .  
## ons_weekday_avg_2024    1  0.0065 0.006529  2.8160   0.09345 .  
## Residuals            2514  5.8287 0.002318                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The ANOVA test suggests that direction, type, and neighborhood are all significant predictors of the variance within on-time percentage, while the quantitative predictors we selected (2024 # routes and weekday bus ons) were both deemed insignificant. Looking further we will plot the estimates and 95% confidence intervals for each factor level of the inputs that were found to be significant to try and discern any interesting effects:

Looking at neighborhoods we can see a very wide spread, which is to be expected with 83 different ones within Pittsburgh. The base level of the factor (to which the rest of the estimates are being compared) is Allegheny Center since it comes first alphabetically. We can see that a vast majority of neighborhoods are significantly different in on-time average (as 0 is not in their 95% intervals), with many both above and below Allegheny Center. For type the base is regular bus stops (which are over 80% of them), and we can see that stops with PAAC shelters are significantly different (with a higher estimate of on-time average) while non-PAAC shelters have an estimate slightly below, but is barely not significant as 0 is within its confidence interval. Finally, we can see that 0 is within both confidence intervals for direction, so it is interesting that the ANOVA reported it as significant.

Conclusions

As previously mentioned we again see that Pittsburgh has uneven transit accessibility. We can see from this analysis that generally the best neighborhoods for timeliness are Allegheny West, Brookline, and Chateau while the worst are Regent Square and Bluff. Stops with PAAC shelters also look to be better for timeliness while non-PAAC shelters are the opposite (although not statistically significant). Finally, whether the stop serves inbound, outbound, or both kinds of routes seems to not have much of an impact, although only inbound routes seem to have a negative effect. Overall, travelers in Pittsburgh may want to evaluate ahead of time before any important bus trips, looking primarily at which neighborhood they plan to catch the bus in. If they need to be somewhere on a tight schedule they may want to leave earleir or later depending on how typically a certain neighborhood has busses arriving late.

How has ridership changed during and after the pandemic, in comparison to pre-pandemic ridership?

## Warning: There were 3 warnings in `mutate()`.
## The first warning was:
## ℹ In argument: `change = as.numeric(ons_weekday_avg_2021) -
##   as.numeric(ons_weekday_avg_2019)`.
## Caused by warning:
## ! NAs introduced by coercion
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 2 remaining warnings.

In order to answer the question of “How has ridership changed pre versus post pandemic?”, we looked at the average change in embarkation (i.e. people getting on the bus) in each neighborhood from both 2019-2021 (pre-pandemic to pandemic) and 2021-2024 (pandemic to post-pandemic). These two graphs show the average change in people getting on at stops as the fill color from 2019 to 2021 in the first graph and 2021 to 2024 in the graph. From the graph, it appears that, going into the pandemic, ridership mostly decreased. By contrast, for the change from 2021-2024 the average change seems to be both smaller and has more variation in whether a neighborhood increased or decreased in ridership.

It seems that the beginning of the pandemic sharply decreased ridership into downtown (the Central Business District), while most other places stayed relatively the same or decreased. After the pandemic (from 2021 to 2024), ridership into downtown increased, along with ridership in Shadyside, Oakland, the North Shore, the South Shore, Chateau and Squirrel Hill while ridership into neighborhoods like Beechwood decreased. For example, the neighborhoods that saw a decrease in average daily embarkation when the pandemic started but then saw an increase after the pandemic were generally places with business districts, like Downtown (Central Business District), Squirrel Hill, Shadyside, and Oakland.

## Warning: package 'pals' was built under R version 4.4.2

The previous graph showed that at the neighborhood level, ridership into business districts decreased when the pandemic began and then increased coming out of the pandemic. In order to fully answer the research question, we wanted to see if there were any lasting impacts of the pandemic on ridership on these most impacted stops.

This graph specifically looks at the 50 stops that showed the sharpest decrease in embarkation from 2019-2021. Similar to the findings from the choropleth map, we can see that a majority of these stops are located in the Central Business District, with multiple stops in Oakland, Shadyside, and Squirrel Hill also being represented. From looking at the heatmap, we can see that the greatest cluster of these points is around -1000 to -500 for the change from 2019-2021, but from around -250 to 250 from 2021-2024, meaning that current ridership to these business districts is still not as high as it was prior to the pandemic.

Conclusions

Through looking at these graphs, we can see that ridership into business districts like the Central Business District, Oakland, and Shadyside fell drastically during the pandemic and then rose afterwards. Moreover, the stops that were the most affected by the pandemic have yet to recover their pre-pandemic ridership levels. It remains to be seen, however, if this has to do with public transit use specifically or simply due to less people going to business districts in general. In order to see if this trend is specific to public transit (and not shared with people who drive cars, for example), we would need another source of data (e.g. where people tended to drive or shop before, during, and after the pandemic). As discussed in the previous section, however, many routes go through downtown, meaning that people embarking at stops downtown could have been going to different places or transferring lines, so this lack of recovery in embarkations at these stops to pre-pandemic levels could also indicate that there are still lasting effects of the pandemic on Pittsburgh’s transit system.

Future Work

We could possibly look into how crash data is related to public transit or see how the number of intersections is related to on time percentages for the different routes across Pittsburgh. Both of these would require incorporating more data, but could provide much richer insights as to what impacts bus traffic.