Philadelphia’s Indego bike share system faces the same operational challenge as every bike share system: rebalancing bikes to meet anticipated demand.
Imagine you’re an Indego operations manager at 6:00 AM on a Monday morning. You have: - 200 stations across Philadelphia - Limited trucks and staff for moving bikes - 2-3 hours before morning rush hour demand peaks - The question: Which stations will run out of bikes by 8:30 AM?
This lab will teach you to build predictive models that forecast bike share demand across space (different stations) and time (different hours) to help solve this operational problem.
# Core tidyverse
library(tidyverse)
library(lubridate)
# Spatial data
library(sf)
library(tigris)
# Census data
library(tidycensus)
# Weather data
library(riem) # For Philadelphia weather from ASOS stations
# Visualization
library(viridis)
library(gridExtra)
library(knitr)
library(kableExtra)
library(grid)
library(gridExtra)
library(patchwork)
library(ggplot2)
# here!
library(here)
# Get rid of scientific notation. We gotta look good!
options(scipen = 999)plotTheme <- ggplot2::theme(
plot.title = ggplot2::element_text(size = 14, face = "bold"),
plot.subtitle = ggplot2::element_text(size = 10),
plot.caption = ggplot2::element_text(size = 8),
axis.text.x = ggplot2::element_text(size = 10, angle = 45, hjust = 1),
axis.text.y = ggplot2::element_text(size = 10),
axis.title = ggplot2::element_text(size = 11, face = "bold"),
panel.background = ggplot2::element_blank(),
panel.grid.major = ggplot2::element_line(colour = "#D0D0D0", size = 0.2),
panel.grid.minor = ggplot2::element_blank(),
axis.ticks = ggplot2::element_blank(),
legend.position = "right"
)
mapTheme <- ggplot2::theme(
plot.title = ggplot2::element_text(size = 14, face = "bold"),
plot.subtitle = ggplot2::element_text(size = 10),
plot.caption = ggplot2::element_text(size = 8),
axis.line = ggplot2::element_blank(),
axis.text = ggplot2::element_blank(),
axis.ticks = ggplot2::element_blank(),
axis.title = ggplot2::element_blank(),
panel.background = ggplot2::element_blank(),
panel.border = ggplot2::element_blank(),
panel.grid.major = ggplot2::element_line(colour = 'transparent'),
panel.grid.minor = ggplot2::element_blank(),
legend.position = "right",
plot.margin = ggplot2::margin(1, 1, 1, 1, 'cm'),
legend.key.height = grid::unit(1, "cm"),
legend.key.width = grid::unit(0.2, "cm")
)
palette5 <- c("#eff3ff", "#E1DCF5", "#BBADF0", "#6849BA", "#361AA3")# Read Q1 2025 data
indegoq1 <- read.csv(here("labs/lab5/data/indego-trips-2024-q1.csv"))
indegoq2 <- read.csv(here("labs/lab5/data/indego-trips-2024-q2.csv"))
indegoq3 <- read.csv(here("labs/lab5/data/indego-trips-2024-q3.csv"))
indegoq4 <- read.csv(here("labs/lab5/data/indego-trips-2024-q4.csv"))
# Look at data
glimpse(indegoq1)## Rows: 193,770
## Columns: 15
## $ trip_id <int> 820756378, 820756236, 820756358, 820756551, 820886…
## $ duration <int> 7, 1, 5, 9, 600, 31, 11, 30, 37, 545, 546, 33, 27,…
## $ start_time <chr> "1/1/2024 0:04", "1/1/2024 0:04", "1/1/2024 0:05",…
## $ end_time <chr> "1/1/2024 0:11", "1/1/2024 0:05", "1/1/2024 0:10",…
## $ start_station <int> 3125, 3068, 3068, 3049, 3010, 3012, 3349, 3012, 32…
## $ start_lat <dbl> 39.94391, 39.93549, 39.93549, 39.94509, 39.94711, …
## $ start_lon <dbl> -75.16735, -75.16711, -75.16711, -75.14250, -75.16…
## $ end_station <int> 3010, 3068, 3086, 3063, 3236, 3209, 3322, 3209, 30…
## $ end_lat <dbl> 39.94711, 39.93549, 39.94019, 39.94633, 39.96566, …
## $ end_lon <dbl> -75.16618, -75.16711, -75.16691, -75.16980, -75.17…
## $ bike_id <chr> "23172", "14667", "14667", "22420", "23095", "0355…
## $ plan_duration <int> 30, 30, 30, 30, 30, 30, 365, 30, 30, 30, 365, 1, 3…
## $ trip_route_category <chr> "One Way", "Round Trip", "One Way", "One Way", "On…
## $ passholder_type <chr> "Indego30", "Indego30", "Indego30", "Indego30", "I…
## $ bike_type <chr> "electric", "standard", "standard", "electric", "e…
| Metric | Value |
|---|---|
| Total trips in Q1 (Jan-March) 2024 | 193,770 |
| Date range (start) | 2024-01-01 00:04:00 |
| Date range (end) | 2024-03-31 23:58:00 |
| Unique start stations | 259 |
| Route Category | Count | Percentage |
|---|---|---|
| One Way | 182,750 | 94.31% |
| Round Trip | 11,020 | 5.69% |
| Passholder Type | Count | Percentage |
|---|---|---|
| Day Pass | 6,966 | 3.59% |
| Indego30 | 114,818 | 59.25% |
| Indego365 | 71,986 | 37.15% |
| Bike Type | Count | Percentage |
|---|---|---|
| electric | 113,962 | 58.8% |
| standard | 79,808 | 41.2% |
# Summary statistics table
summary_stats <- data.frame(
Metric = c(
"Total trips in Q2 2024",
"Date range (start)",
"Date range (end)",
"Unique start stations"
),
Value = c(
format(nrow(indegoq2), big.mark = ","),
as.character(min(mdy_hm(indegoq2$start_time))),
as.character(max(mdy_hm(indegoq2$start_time))),
length(unique(indegoq2$start_station))
)
)
kable(summary_stats,
caption = "Q2 2024 Indego Trip Summary",
col.names = c("Metric", "Value"),
align = c("l", "r")) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE)| Metric | Value |
|---|---|
| Total trips in Q2 2024 | 368,091 |
| Date range (start) | 2024-04-01 |
| Date range (end) | 2024-06-30 23:56:00 |
| Unique start stations | 255 |
# Trip route categories
route_table <- as.data.frame(table(indegoq2$trip_route_category))
colnames(route_table) <- c("Trip Route Category", "Count")
route_table$Percentage <- paste0(round(100 * route_table$Count / sum(route_table$Count), 1), "%")
kable(route_table,
caption = "Route Categories",
format.args = list(big.mark = ","),
align = c("l", "r", "r")) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE)| Trip Route Category | Count | Percentage |
|---|---|---|
| One Way | 342,299 | 93% |
| Round Trip | 25,792 | 7% |
# Passholder types (including NAs)
passholder_table <- as.data.frame(table(indegoq2$passholder_type))
colnames(passholder_table) <- c("Passholder Type", "Count")
passholder_table$Percentage <- paste0(round(100 * passholder_table$Count / sum(passholder_table$Count), 2), "%")
kable(passholder_table,
caption = "Passholder Type",
format.args = list(big.mark = ","),
align = c("l", "r", "r")) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE)| Passholder Type | Count | Percentage |
|---|---|---|
| Day Pass | 20,926 | 5.69% |
| Indego30 | 231,909 | 63% |
| Indego365 | 115,256 | 31.31% |
# Bike types
bike_table <- as.data.frame(table(indegoq2$bike_type))
colnames(bike_table) <- c("Bike Type", "Count")
bike_table$Percentage <- paste0(round(100 * bike_table$Count / sum(bike_table$Count), 1), "%")
kable(bike_table,
caption = "Bike Types",
format.args = list(big.mark = ","),
align = c("l", "r", "r")) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE)| Bike Type | Count | Percentage |
|---|---|---|
| electric | 216,497 | 58.8% |
| standard | 151,594 | 41.2% |
# Summary statistics table
summary_stats <- data.frame(
Metric = c(
"Total trips in Q3 (Jul-Sept) 2024",
"Date range (start)",
"Date range (end)",
"Unique start stations"
),
Value = c(
format(nrow(indegoq3), big.mark = ","),
as.character(min(mdy_hm(indegoq3$start_time))),
as.character(max(mdy_hm(indegoq3$start_time))),
length(unique(indegoq3$start_station))
)
)
kable(summary_stats,
caption = "Q3 (Jul - Sept 2024) Indego Trip Summary",
col.names = c("Metric", "Value"),
align = c("l", "r")) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE)| Metric | Value |
|---|---|
| Total trips in Q3 (Jul-Sept) 2024 | 408,408 |
| Date range (start) | 2024-07-01 00:02:00 |
| Date range (end) | 2024-09-30 23:59:00 |
| Unique start stations | 261 |
# Trip route categories
route_table <- as.data.frame(table(indegoq3$trip_route_category))
colnames(route_table) <- c("Route Category", "Count")
route_table$Percentage <- paste0(round(100 * route_table$Count / sum(route_table$Count), 2), "%")
kable(route_table,
caption = "Route Categories",
format.args = list(big.mark = ","),
align = c("l", "r", "r")) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE)| Route Category | Count | Percentage |
|---|---|---|
| One Way | 380,939 | 93.27% |
| Round Trip | 27,469 | 6.73% |
# Passholder types
passholder_table <- as.data.frame(table(indegoq3$passholder_type))
colnames(passholder_table) <- c("Passholder Type", "Count")
passholder_table$Percentage <- paste0(round(100 * passholder_table$Count / sum(passholder_table$Count), 2), "%")
kable(passholder_table,
caption = "Passholder Type",
format.args = list(big.mark = ","),
align = c("l", "r", "r")) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE)| Passholder Type | Count | Percentage |
|---|---|---|
| Day Pass | 22,885 | 5.6% |
| Indego30 | 239,857 | 58.73% |
| Indego365 | 132,358 | 32.41% |
| Walk-up | 13,308 | 3.26% |
# Bike types
bike_table <- as.data.frame(table(indegoq3$bike_type))
colnames(bike_table) <- c("Bike Type", "Count")
bike_table$Percentage <- paste0(round(100 * bike_table$Count / sum(bike_table$Count), 1), "%")
kable(bike_table,
caption = "Bike Types",
format.args = list(big.mark = ","),
align = c("l", "r", "r")) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE)| Bike Type | Count | Percentage |
|---|---|---|
| electric | 236,839 | 58% |
| standard | 171,569 | 42% |
| Metric | Value |
|---|---|
| Total trips in Q4 (Oct-Dec) 2024 | 299,121 |
| Date range (start) | 2024-10-01 |
| Date range (end) | 2024-12-31 23:56:00 |
| Unique start stations | 256 |
| Route Category | Count | Percentage |
|---|---|---|
| One Way | 282,675 | 94.5% |
| Round Trip | 16,446 | 5.5% |
| Passholder Type | Count | Percentage |
|---|---|---|
| Day Pass | 10,711 | 3.58% |
| Indego30 | 159,895 | 53.45% |
| Indego365 | 112,690 | 37.67% |
| IndegoFlex | 1 | 0% |
| NULL | 1,165 | 0.39% |
| Walk-up | 14,659 | 4.9% |
| Bike Type | Count | Percentage |
|---|---|---|
| electric | 216,497 | 58.8% |
| standard | 151,594 | 41.2% |
We need to aggregate trips into hourly intervals for our panel data structure.
I selected all four quarters of 2024 rather than a single quarter for a few reasons. First, combining all quarters allows for a larger training dataset which can help minimize the risk of overfitting. Second, Philadelphia experiences very seasonal weather, which in turn affects bike share ridership and demand. In the winter months (Q1 and Q4), low ridership can be attributed to cold temperatures and inclement weather, whereas the spring and summer (Q2 and Q3) see most of the demand. By including the full year, the model learns these seasonal patterns and can generalize better.
Comparing across all quarters reveals clear temporal differences that are explored in the “Trips Over Time” and “Hourly Patterns” sections below.
The drastic dip in early October that shoots from 3000 trips to almost 5000 is suspicious. I tried looking up notable events that occurred in Philadelphia last fall, and I couldn’t find much information. Also the spike in ridership in Q1 March 2024 could be due to a number of things, oddly warm weather, SEPTA safety issues, and more.
This code chunk combines all quarters into one variable for easier synthesis once we start looking at the regressions and temporal patterns.
indego_all <- bind_rows(
Q1_2024 = indegoq1, # Now these have interval60, week, month, hour, etc.
Q2_2024 = indegoq2,
Q3_2024 = indegoq3,
Q4_2024 = indegoq4,
.id = "quarter"
)
# Verify time features are present
cat("Total trips in processed indego_all:", format(nrow(indego_all), big.mark = ","), "\n")## Total trips in processed indego_all: 1,269,390
cat("Time features present:", all(c("interval60", "week", "month", "hour", "dotw") %in% names(indego_all)), "\n")## Time features present: TRUE
##
## Q1_2024 Q2_2024 Q3_2024 Q4_2024
## 193770 368091 408408 299121
In this code chunk I took each quarter, assigned it to a variable, then assigned that variable to a plotting variable, then lined them up for display below.
During typical morning and evening rush hour commute times, there is peak Indego ridership almost consistently across all four quarters at what is assuming to be around 7/8 AM and 5/6 PM. For the weekends, there is more of a gradual increase in ridership as the day progresses, then casually falls off as the evening progresses.
The chunk below ranks each station based on the station that has the most starts/deployments from.
| Rank | Station | Total Trips | Avg Daily |
|---|---|---|---|
| 1 | 3010 | 3,722 | 40.9 |
| 2 | 3295 | 3,358 | 36.9 |
| 3 | 3032 | 2,966 | 32.6 |
| 4 | 3020 | 2,747 | 30.2 |
| 5 | 3208 | 2,737 | 30.1 |
| 6 | 3359 | 2,513 | 27.6 |
| 7 | 3012 | 2,480 | 27.3 |
| 8 | 3296 | 2,384 | 26.2 |
| 9 | 3066 | 2,321 | 25.5 |
| 10 | 3052 | 2,282 | 25.4 |
| Rank | Station | Total Trips | Avg Daily |
|---|---|---|---|
| 1 | 3010 | 6,115 | 67.2 |
| 2 | 3032 | 5,231 | 58.1 |
| 3 | 3295 | 4,451 | 48.9 |
| 4 | 3359 | 4,248 | 46.7 |
| 5 | 3022 | 4,070 | 44.7 |
| 6 | 3028 | 4,052 | 44.5 |
| 7 | 3066 | 4,047 | 44.5 |
| 8 | 3054 | 3,847 | 42.3 |
| 9 | 3020 | 3,792 | 41.7 |
| 10 | 3208 | 3,661 | 40.2 |
| Rank | Station | Total Trips | Avg Daily |
|---|---|---|---|
| 1 | 3010 | 6,654 | 72.3 |
| 2 | 3032 | 5,436 | 59.1 |
| 3 | 3244 | 4,421 | 48.1 |
| 4 | 3054 | 4,305 | 46.8 |
| 5 | 3359 | 4,305 | 46.8 |
| 6 | 3296 | 4,252 | 46.2 |
| 7 | 3101 | 4,226 | 45.9 |
| 8 | 3020 | 4,154 | 45.2 |
| 9 | 3295 | 4,144 | 45.0 |
| 10 | 3066 | 4,114 | 44.7 |
| Rank | Station | Total Trips | Avg Daily |
|---|---|---|---|
| 1 | 3010 | 5,943 | 64.6 |
| 2 | 3032 | 4,471 | 48.6 |
| 3 | 3359 | 3,923 | 42.6 |
| 4 | 3244 | 3,492 | 38.0 |
| 5 | 3295 | 3,411 | 37.1 |
| 6 | 3020 | 3,369 | 36.6 |
| 7 | 3208 | 3,343 | 36.3 |
| 8 | 3066 | 3,342 | 36.3 |
| 9 | 3054 | 3,297 | 35.8 |
| 10 | 3101 | 3,214 | 34.9 |
Station #3010 in Q1 2024 has the most origin trips at 40.9 average trips per day. Station #3010 continues the trend of being the highest starting point of travel in all quarters of 2024.
We’ll get census tract data to add demographic context to our stations.
## | | | 0% | |= | 1% | |== | 2% | |== | 3% | |=== | 4% | |=== | 5% | |==== | 5% | |==== | 6% | |===== | 7% | |===== | 8% | |====== | 8% | |====== | 9% | |======= | 10% | |======= | 11% | |======== | 11% | |======== | 12% | |========= | 13% | |========== | 14% | |=========== | 15% | |=========== | 16% | |============ | 17% | |============= | 18% | |============= | 19% | |============== | 20% | |=============== | 21% | |=============== | 22% | |================ | 23% | |================= | 24% | |================= | 25% | |================== | 26% | |=================== | 27% | |=================== | 28% | |==================== | 29% | |===================== | 30% | |===================== | 31% | |====================== | 32% | |======================= | 33% | |======================== | 34% | |======================== | 35% | |========================= | 36% | |========================== | 37% | |========================== | 38% | |=========================== | 39% | |============================ | 40% | |============================ | 41% | |============================= | 41% | |============================== | 42% | |============================== | 43% | |=============================== | 44% | |================================ | 45% | |================================ | 46% | |================================= | 47% | |================================== | 48% | |=================================== | 49% | |=================================== | 50% | |==================================== | 51% | |===================================== | 52% | |===================================== | 53% | |====================================== | 54% | |======================================= | 55% | |======================================= | 56% | |======================================== | 57% | |========================================= | 58% | |========================================= | 59% | |========================================== | 60% | |=========================================== | 61% | |=========================================== | 62% | |============================================ | 63% | |============================================= | 64% | |============================================== | 65% | |============================================== | 66% | |=============================================== | 67% | |================================================ | 68% | |================================================ | 69% | |================================================= | 70% | |================================================== | 71% | |================================================== | 72% | |=================================================== | 73% | |==================================================== | 74% | |==================================================== | 75% | |===================================================== | 76% | |====================================================== | 77% | |====================================================== | 78% | |======================================================= | 79% | |======================================================== | 80% | |========================================================= | 81% | |========================================================= | 82% | |========================================================== | 83% | |=========================================================== | 84% | |=========================================================== | 85% | |============================================================ | 86% | |============================================================= | 87% | |============================================================= | 88% | |============================================================== | 89% | |=============================================================== | 90% | |=============================================================== | 91% | |================================================================ | 92% | |================================================================= | 92% | |================================================================= | 93% | |================================================================== | 94% | |=================================================================== | 95% | |=================================================================== | 96% | |==================================================================== | 97% | |===================================================================== | 98% | |======================================================================| 99% | |======================================================================| 100%
## Rows: 408
## Columns: 17
## $ GEOID <chr> "42101001500", "42101001800", "42101002802", "4…
## $ NAME <chr> "Census Tract 15; Philadelphia County; Pennsylv…
## $ Total_Pop <dbl> 3251, 3300, 5720, 4029, 4415, 1815, 3374, 2729,…
## $ B01003_001M <dbl> 677, 369, 796, 437, 853, 210, 480, 734, 763, 11…
## $ Med_Inc <dbl> 110859, 114063, 78871, 61583, 32347, 48581, 597…
## $ B19013_001M <dbl> 24975, 30714, 20396, 22293, 4840, 13812, 6278, …
## $ Total_Commuters <dbl> 2073, 2255, 3032, 2326, 1980, 969, 2427, 708, 2…
## $ B08301_001M <dbl> 387, 308, 478, 383, 456, 189, 380, 281, 456, 68…
## $ Transit_Commuters <dbl> 429, 123, 685, 506, 534, 192, 658, 218, 438, 51…
## $ B08301_010M <dbl> 188, 66, 219, 144, 285, 71, 278, 184, 176, 235,…
## $ White_Pop <dbl> 2185, 2494, 3691, 3223, 182, 984, 2111, 231, 35…
## $ B02001_002M <dbl> 268, 381, 592, 380, 88, 190, 463, 112, 238, 778…
## $ Med_Home_Value <dbl> 568300, 605000, 350600, 296400, 76600, 289700, …
## $ B25077_001M <dbl> 58894, 34876, 12572, 22333, 10843, 118720, 1506…
## $ geometry <MULTIPOLYGON [°]> MULTIPOLYGON (((-75.16558 3..., MU…
## $ Percent_Taking_Transit <dbl> 20.694645, 5.454545, 22.592348, 21.754084, 26.9…
## $ Percent_White <dbl> 67.2100892, 75.5757576, 64.5279720, 79.9950360,…
# Create station location data
stations_q1_clean <- indegoq1 %>%
group_by(start_station, start_lon, start_lat) %>%
summarize(trips = n(), .groups = "drop")
cat("Removed", nrow(stations_q1_clean) - nrow(stations_q1_clean), "stations with missing coordinates\n")## Removed 0 stations with missing coordinates
stations_q2 <- indegoq2 %>%
group_by(start_station, start_lon, start_lat) %>%
summarize(trips = n(), .groups = "drop")
stations_q3 <- indegoq3 %>%
group_by(start_station, start_lon, start_lat) %>%
summarize(trips = n(), .groups = "drop")
stations_q4 <- indegoq4 %>%
group_by(start_station, start_lon, start_lat) %>%
summarize(trips = n(), .groups = "drop")
# Q1 Map
mhhinc1 <- ggplot() +
geom_sf(data = philly_census, aes(fill = Med_Inc), color = NA) +
scale_fill_viridis(
option = "viridis",
name = "Median Income",
labels = scales::label_dollar(scale = 1/1000, suffix = "K")
) +
labs(title = "Q1 2024") +
geom_point(
data = stations_q1_clean,
aes(x = start_lon, y = start_lat, size = trips),
color = "#CC2810", alpha = 0.1
) +
scale_size_continuous(range = c(1, 4), guide = "none") +
coord_sf(xlim = c(-75.28, -74.96), ylim = c(39.87, 40.14), expand = FALSE) + # FIX: Force proper extent
mapTheme +
theme(legend.position = "none")
# Q2 Map
mhhinc2 <- ggplot() +
geom_sf(data = philly_census, aes(fill = Med_Inc), color = NA) +
scale_fill_viridis(
option = "viridis",
name = "Median Income",
labels = scales::label_dollar(scale = 1/1000, suffix = "K")
) +
labs(title = "Q2 2024") +
geom_point(
data = stations_q2,
aes(x = start_lon, y = start_lat, size = trips),
color = "#CC2810", alpha = 0.1
) +
scale_size_continuous(range = c(1, 4), guide = "none") +
mapTheme +
theme(legend.position = "none")
# Q3 Map
mhhinc3 <- ggplot() +
geom_sf(data = philly_census, aes(fill = Med_Inc), color = NA) +
scale_fill_viridis(
option = "viridis",
name = "Median Income",
labels = scales::label_dollar(scale = 1/1000, suffix = "K")
) +
labs(title = "Q3 2024") +
geom_point(
data = stations_q3,
aes(x = start_lon, y = start_lat, size = trips),
color = "#CC2810", alpha = 0.1
) +
scale_size_continuous(range = c(1, 4), guide = "none") +
mapTheme +
theme(legend.position = "none")
# Q4 Map
mhhinc4 <- ggplot() +
geom_sf(data = philly_census, aes(fill = Med_Inc), color = NA) +
scale_fill_viridis(
option = "viridis",
name = "Median Income",
labels = scales::label_dollar(scale = 1/1000, suffix = "K")
) +
labs(title = "Q4 2024") +
geom_point(
data = stations_q4,
aes(x = start_lon, y = start_lat, size = trips),
color = "#CC2810", alpha = 0.1
) +
scale_size_continuous(range = c(1, 4), guide = "none") +
mapTheme
mhhinc1 + mhhinc2 + mhhinc3 + mhhinc4 + # <-- ADD ALL FOUR!
plot_layout(guides = "collect", nrow = 1) +
plot_annotation(
title = "Indego Station Activity by Quarter: 2024",
subtitle = "Red points show station locations sized by trip volume",
caption = "Source: Indego bike share, US Census ACS 2022"
) &
theme(
legend.position = "bottom",
legend.direction = "horizontal",
legend.key.width = unit(3, "cm"),
legend.key.height = unit(0.3, "cm"),
legend.box.margin = margin(t = 10)
)A quick visualization shows there is not much differentiation between the quarters in terms of trip volume at the city-scale.
We’ll spatially join census characteristics to each bike station.
# Create sf object for stations from COMBINED data
stations_sf <- indego_all %>%
distinct(start_station, start_lat, start_lon) %>%
filter(!is.na(start_lat), !is.na(start_lon)) %>%
st_as_sf(coords = c("start_lon", "start_lat"), crs = 4326)
# Spatial join to get census tract for each station
stations_census <- st_join(stations_sf, philly_census, left = TRUE) %>%
st_drop_geometry()
# Look at the result - investigate whether all of the stations joined to census data
stations_for_map <- indego_all %>%
distinct(start_station, start_lat, start_lon) %>%
filter(!is.na(start_lat), !is.na(start_lon)) %>%
left_join(
stations_census %>% dplyr::select(start_station, Med_Inc),
by = "start_station"
) %>%
mutate(has_census = !is.na(Med_Inc))
# Add back to trip data
indego_all_census <- indego_all %>%
left_join(
stations_census %>%
dplyr::select(start_station, Med_Inc, Percent_Taking_Transit,
Percent_White, Total_Pop),
by = "start_station"
)
# Prepare data for visualization
stations_for_map <- indego_all %>%
distinct(start_station, start_lat, start_lon) %>%
filter(!is.na(start_lat), !is.na(start_lon)) %>%
left_join(
stations_census %>% dplyr::select(start_station, Med_Inc),
by = "start_station"
) %>%
mutate(has_census = !is.na(Med_Inc))
# Create the map showing problem stations
ggplot() +
geom_sf(data = philly_census, aes(fill = Med_Inc), color = "white", size = 0.1) +
scale_fill_viridis(
option = "viridis",
name = "Median\nIncome",
labels = scales::dollar,
na.value = "grey90"
) +
# Stations with census data (small grey dots)
geom_point(
data = stations_for_map %>% filter(has_census),
aes(x = start_lon, y = start_lat),
color = "grey30", size = 1, alpha = 0.6
) +
# Stations WITHOUT census data (red X marks the spot)
geom_point(
data = stations_for_map %>% filter(!has_census),
aes(x = start_lon, y = start_lat),
color = "red", size = 1, shape = 4, stroke = 1.5
) +
labs(
title = "Philadelphia Median Household Income by Census Tract",
subtitle = "Indego stations shown (RED = no census data match)",
caption = "Red X marks indicate stations that didn't join to census tracts"
) +
mapThemeUnsure what the deal is with this outlier, as our earlier cleaning should have done away with it.
# Dealing with missing data
# Identify which stations to keep (only residential neighborhoods)
valid_stations <- stations_census %>%
filter(!is.na(Med_Inc)) %>%
pull(start_station)
# Filter trip data to valid stations only
indego_all_census <- indego_all %>%
filter(start_station %in% valid_stations) %>%
left_join(
stations_census %>%
dplyr::select(start_station, Med_Inc, Percent_Taking_Transit,
Percent_White, Total_Pop),
by = "start_station"
)
# Summary
cat("Total trips before filtering:", format(nrow(indego_all), big.mark = ","), "\n")## Total trips before filtering: 1,269,390
## Trips at residential stations: 1,192,977
## Trips removed: 76,413
Weather significantly affects bike share demand! We need to get weather data covering all four quarters of 2024 plus Q1 2025.
# Get weather from Philadelphia International Airport (KPHL)
# Covering full year: Q1 2024 through Q4 2024
# Query each quarter separately
weather_q1 <- riem_measures(station = "PHL",
date_start = "2024-01-01",
date_end = "2024-03-31") #Quarter 1 '24
weather_q2 <- riem_measures(station = "PHL",
date_start = "2024-04-01",
date_end = "2024-06-30") # Quarter 2 '24
weather_q3 <- riem_measures(station = "PHL",
date_start = "2024-07-01",
date_end = "2024-09-30") # Quarter 3 '24
weather_q4 <- riem_measures(station = "PHL",
date_start = "2024-10-01",
date_end = "2024-12-31") # Quarter 4 '24
# Combine them
weather_2024 <- bind_rows(weather_q1, weather_q2, weather_q3, weather_q4)
# Process weather data
weather_processed <- weather_2024 %>%
mutate(
interval60 = floor_date(valid, unit = "hour"),
Temperature = tmpf, # Temperature in Fahrenheit
Precipitation = ifelse(is.na(p01i), 0, p01i), # Hourly precip in inches
Wind_Speed = sknt # Wind speed in knots
) %>%
dplyr::select(interval60, Temperature, Precipitation, Wind_Speed) %>%
distinct()%>%
arrange(interval60)
# Check for missing hours and interpolate if needed
weather_complete <- weather_processed %>%
complete(interval60 = seq(min(interval60), max(interval60), by = "hour")) %>%
fill(Temperature, Precipitation, Wind_Speed, .direction = "down")
# Summary statistics
cat("Weather data coverage:\n")## Weather data coverage:
## Start: 2024-01-01
## End: 2024-12-30 23:00:00
## Total hours: 10693
## Temperature Precipitation Wind_Speed
## Min. :12.00 Min. :0.000000 Min. : 0.000
## 1st Qu.:46.00 1st Qu.:0.000000 1st Qu.: 5.000
## Median :59.00 Median :0.000000 Median : 7.000
## Mean :58.87 Mean :0.006827 Mean : 7.619
## 3rd Qu.:73.00 3rd Qu.:0.000000 3rd Qu.:10.000
## Max. :98.00 Max. :1.250000 Max. :44.000
# Temperature patterns by quarter
weather_complete %>%
mutate(
quarter = paste0("Q", quarter(interval60), " 2024")
) %>%
ggplot(aes(x = interval60, y = Temperature, color = quarter)) +
geom_line(alpha = 0.6) +
geom_smooth(se = FALSE, linewidth = 1.2) +
scale_color_viridis_d(option = "plasma") +
labs(
title = "Philadelphia Temperature Patterns Across All Quarters",
subtitle = "2024 Seasonal Temperature Variation",
x = "Date",
y = "Temperature (°F)",
color = "Quarter"
) +
plotTheme +
theme(legend.position = "bottom")# Count trips by station-hour
trips_panel <- indego_all_census %>%
group_by(interval60, start_station, start_lat, start_lon,
Med_Inc, Percent_Taking_Transit, Percent_White, Total_Pop) %>%
summarize(Trip_Count = n()) %>%
ungroup()
# Summary statistics
cat("Trips by station-hour:\n")## Trips by station-hour:
## How many station-hour observations?: 633678
## How many unique stations?: 254
## How many unique hours?: 8762
Not every station has trips every hour. We need a complete panel where every station-hour combination exists (even if Trip_Count = 0).
# Calculate expected panel size
n_stations <- length(unique(trips_panel$start_station))
n_hours <- length(unique(trips_panel$interval60))
expected_rows <- n_stations * n_hours
cat("Expected panel rows:", format(expected_rows, big.mark = ","), "\n")## Expected panel rows: 2,225,548
## Current rows: 633,678
## Missing rows: 1,591,870
# Create complete panel
# Create complete panel
study_panel <- expand.grid(
interval60 = unique(trips_panel$interval60),
start_station = unique(trips_panel$start_station)
) %>%
left_join(trips_panel, by = c("interval60", "start_station")) %>%
mutate(
Trip_Count = replace_na(Trip_Count, 0),
# extract month from interval60 (assuming POSIXct)
month = lubridate::month(interval60, label = TRUE, abbr = TRUE),
# force ALL 12 months so no new levels occur in testing
month = factor(month, levels = month.abb)
)
# Fill in station attributes (same for all hours)
station_attributes <- trips_panel %>%
group_by(start_station) %>%
summarize(
start_lat = first(start_lat),
start_lon = first(start_lon),
Med_Inc = first(Med_Inc),
Percent_Taking_Transit = first(Percent_Taking_Transit),
Percent_White = first(Percent_White),
Total_Pop = first(Total_Pop)
)
study_panel <- study_panel %>%
left_join(station_attributes, by = "start_station")
cat("Complete panel rows:", format(nrow(study_panel), big.mark = ","), "\n")## Complete panel rows: 2,225,548
study_panel <- study_panel %>%
mutate(
week = week(interval60),
month = month(interval60, label = TRUE),
dotw = wday(interval60, label = TRUE),
hour = hour(interval60),
date = as.Date(interval60),
weekend = ifelse(dotw %in% c("Sat", "Sun"), 1, 0),
rush_hour = ifelse(hour %in% c(7, 8, 9, 16, 17, 18), 1, 0)
)study_panel <- study_panel %>%
left_join(weather_complete, by = "interval60")
# Check for missing values
summary(study_panel %>% select(Trip_Count, Temperature, Precipitation))## Trip_Count Temperature Precipitation
## Min. : 0.0000 Min. :12.0 Min. :0.0000
## 1st Qu.: 0.0000 1st Qu.:46.0 1st Qu.:0.0000
## Median : 0.0000 Median :59.0 Median :0.0000
## Mean : 0.5156 Mean :58.9 Mean :0.0068
## 3rd Qu.: 1.0000 3rd Qu.:73.0 3rd Qu.:0.0000
## Max. :28.0000 Max. :98.0 Max. :1.2500
## NA's :6096 NA's :6096
The key innovation for space-time prediction: past demand predicts future demand.
If there were 15 bike trips from Station A at 8:00 AM, there will probably be ~15 trips at 9:00 AM. We can use this temporal persistence to improve predictions.
# Sort by station and time
study_panel <- study_panel %>%
arrange(start_station, interval60)
# Create lag variables WITHIN each station
study_panel <- study_panel %>%
group_by(start_station) %>%
mutate(
lag1Hour = lag(Trip_Count, 1),
lag2Hours = lag(Trip_Count, 2),
lag3Hours = lag(Trip_Count, 3),
lag12Hours = lag(Trip_Count, 12),
lag1day = lag(Trip_Count, 24)
) %>%
ungroup()
# Remove rows with NA lags (first 24 hours for each station)
study_panel_complete <- study_panel %>%
filter(!is.na(lag1day))
cat("Rows after removing NA lags:", format(nrow(study_panel_complete), big.mark = ","), "\n")## Rows after removing NA lags: 2,709,672
# Find stations with highest trip counts
top_stations <- study_panel_complete %>%
group_by(start_station) %>%
summarize(total_trips = sum(Trip_Count)) %>%
arrange(desc(total_trips)) %>%
head(10)
print(top_stations)## # A tibble: 10 × 2
## start_station total_trips
## <int> <int>
## 1 3010 26470
## 2 3032 21419
## 3 3295 18090
## 4 3359 17482
## 5 3066 16242
## 6 3244 15695
## 7 3028 15602
## 8 3054 15591
## 9 3022 15376
## 10 3101 15078
# Pick a high-activity station (use one from the list above)
example_station <- study_panel_complete %>%
filter(start_station == top_stations$start_station[1]) %>% # Highest activity station
head(168) # One week
# Check if you have data
cat("Trip count range:", range(example_station$Trip_Count), "\n")## Trip count range: 0 10
## Mean trips: 1.446429
# Now plot
ggplot(example_station, aes(x = interval60)) +
geom_line(aes(y = Trip_Count, color = "Current"), linewidth = 1) +
geom_line(aes(y = lag1Hour, color = "1 Hour Ago"), linewidth = 1, alpha = 0.7) +
geom_line(aes(y = lag1day, color = "24 Hours Ago"), linewidth = 1, alpha = 0.7) +
scale_color_manual(values = c(
"Current" = "#08519c",
"1 Hour Ago" = "#3182bd",
"24 Hours Ago" = "#6baed6"
)) +
labs(
title = "Temporal Lag Patterns at One Station",
subtitle = "Past demand predicts future demand",
x = "Date-Time",
y = "Trip Count",
color = "Time Period"
) +
plotThemeCRITICAL: We must train on PAST data and test on FUTURE data!
In real operations, at 6:00 AM on March 15, we need to predict demand for March 15-31. We have data from Jan 1 - March 14, but NOT from March 15-31 (it hasn’t happened yet!).
Wrong approach: Train on weeks 10-13, test on weeks 1-9 (predicting past from future!)
Correct approach: Train on weeks 1-9, test on weeks 10-13 (predicting future from past)
# Split by week
# Q1 has weeks 1-13 (Jan-Mar)
# Train on weeks 1-9 (Jan 1 - early March)
# Test on weeks 10-13 (rest of March)
# Which stations have trips in BOTH early and late periods?
early_stations <- study_panel_complete %>%
filter(week < 10) %>%
filter(Trip_Count > 0) %>%
distinct(start_station) %>%
pull(start_station)
late_stations <- study_panel_complete %>%
filter(week >= 10) %>%
filter(Trip_Count > 0) %>%
distinct(start_station) %>%
pull(start_station)
# Keep only stations that appear in BOTH periods
common_stations <- intersect(early_stations, late_stations)
# Filter panel to only common stations
study_panel_complete <- study_panel_complete %>%
filter(start_station %in% common_stations)
# NOW create train/test split
train <- study_panel_complete %>%
filter(week < 10)
test <- study_panel_complete %>%
filter(week >= 10)
cat("Training observations:", format(nrow(train), big.mark = ","), "\n")## Training observations: 445,496
## Testing observations: 2,040,148
## Training date range: 19723 to 19785
## Testing date range: 19786 to 20088
We’ll build 5 models with increasing complexity to see what improves predictions.
# Create day of week factor with treatment (dummy) coding
train <- train %>%
mutate(dotw_simple = factor(dotw, levels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")))
# Set contrasts to treatment coding (dummy variables)
contrasts(train$dotw_simple) <- contr.treatment(7)
# Now run the model
model1 <- lm(
Trip_Count ~ as.factor(hour) + dotw_simple + Temperature + Precipitation,
data = train
)
summary(model1)##
## Call:
## lm(formula = Trip_Count ~ as.factor(hour) + dotw_simple + Temperature +
## Precipitation, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.8997 -0.3927 -0.1859 0.0252 14.2090
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.1658768 0.0080090 -20.711 < 0.0000000000000002 ***
## as.factor(hour)1 -0.0278062 0.0076114 -3.653 0.000259 ***
## as.factor(hour)2 -0.0277130 0.0076920 -3.603 0.000315 ***
## as.factor(hour)3 -0.0528180 0.0077975 -6.774 0.0000000000126 ***
## as.factor(hour)4 -0.0262147 0.0076254 -3.438 0.000586 ***
## as.factor(hour)5 0.0434695 0.0076812 5.659 0.0000000152186 ***
## as.factor(hour)6 0.1577783 0.0075942 20.776 < 0.0000000000000002 ***
## as.factor(hour)7 0.3041425 0.0077357 39.317 < 0.0000000000000002 ***
## as.factor(hour)8 0.4836049 0.0074759 64.689 < 0.0000000000000002 ***
## as.factor(hour)9 0.3756329 0.0076227 49.278 < 0.0000000000000002 ***
## as.factor(hour)10 0.2979748 0.0075114 39.669 < 0.0000000000000002 ***
## as.factor(hour)11 0.2975841 0.0074157 40.129 < 0.0000000000000002 ***
## as.factor(hour)12 0.3567110 0.0073915 48.260 < 0.0000000000000002 ***
## as.factor(hour)13 0.3687245 0.0075332 48.946 < 0.0000000000000002 ***
## as.factor(hour)14 0.3830709 0.0075447 50.773 < 0.0000000000000002 ***
## as.factor(hour)15 0.4299939 0.0075008 57.326 < 0.0000000000000002 ***
## as.factor(hour)16 0.4808562 0.0075001 64.113 < 0.0000000000000002 ***
## as.factor(hour)17 0.6077052 0.0074797 81.248 < 0.0000000000000002 ***
## as.factor(hour)18 0.4438296 0.0075275 58.961 < 0.0000000000000002 ***
## as.factor(hour)19 0.3029202 0.0076918 39.382 < 0.0000000000000002 ***
## as.factor(hour)20 0.1732385 0.0077211 22.437 < 0.0000000000000002 ***
## as.factor(hour)21 0.1353487 0.0075538 17.918 < 0.0000000000000002 ***
## as.factor(hour)22 0.0930418 0.0076863 12.105 < 0.0000000000000002 ***
## as.factor(hour)23 0.0491728 0.0076353 6.440 0.0000000001194 ***
## dotw_simple2 -0.0502449 0.0042331 -11.869 < 0.0000000000000002 ***
## dotw_simple3 0.0287986 0.0043908 6.559 0.0000000000543 ***
## dotw_simple4 0.0220058 0.0044176 4.981 0.0000006315001 ***
## dotw_simple5 -0.0205392 0.0043194 -4.755 0.0000019834995 ***
## dotw_simple6 -0.1181527 0.0043247 -27.320 < 0.0000000000000002 ***
## dotw_simple7 -0.0940563 0.0043776 -21.486 < 0.0000000000000002 ***
## Temperature 0.0070358 0.0001312 53.612 < 0.0000000000000002 ***
## Precipitation -1.4704955 0.0384650 -38.229 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7182 on 445464 degrees of freedom
## Multiple R-squared: 0.07979, Adjusted R-squared: 0.07973
## F-statistic: 1246 on 31 and 445464 DF, p-value: < 0.00000000000000022
The model uses Monday as the baseline. Each coefficient represents the difference in expected trips per station-hour compared to Monday - dow_simple2 = Tuesday..
Weekday Pattern (Tue-Fri):
Weekend Pattern (Sat-Sun):
Hourly Interpretation
Hour Coefficient Interpretation 0 (baseline) 0.000 trips/hour (midnight) 1 -0.028 slightly fewer than midnight … 6 +0.158 morning activity starting 7 +0.304 morning rush building 8 +0.484 PEAK morning rush 9 +0.376 post-rush … 17 +0.608 PEAK evening rush (5 PM!) 18 +0.444 evening declining … 23 +0.049 late night minimal
model2 <- lm(
Trip_Count ~ as.factor(hour) + dotw_simple + Temperature + Precipitation +
lag1Hour + lag3Hours + lag1day,
data = train
)
summary(model2)##
## Call:
## lm(formula = Trip_Count ~ as.factor(hour) + dotw_simple + Temperature +
## Precipitation + lag1Hour + lag3Hours + lag1day, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.4519 -0.2470 -0.0946 0.0300 11.1281
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.0920058 0.0070480 -13.054 < 0.0000000000000002 ***
## as.factor(hour)1 -0.0132102 0.0066959 -1.973 0.048511 *
## as.factor(hour)2 -0.0024910 0.0067675 -0.368 0.712811
## as.factor(hour)3 -0.0208350 0.0068627 -3.036 0.002398 **
## as.factor(hour)4 0.0064110 0.0067119 0.955 0.339491
## as.factor(hour)5 0.0551125 0.0067626 8.150 0.000000000000000366 ***
## as.factor(hour)6 0.1442512 0.0066896 21.564 < 0.0000000000000002 ***
## as.factor(hour)7 0.2316881 0.0068170 33.987 < 0.0000000000000002 ***
## as.factor(hour)8 0.3405440 0.0065971 51.621 < 0.0000000000000002 ***
## as.factor(hour)9 0.1581361 0.0067347 23.481 < 0.0000000000000002 ***
## as.factor(hour)10 0.1230812 0.0066256 18.577 < 0.0000000000000002 ***
## as.factor(hour)11 0.1445921 0.0065409 22.106 < 0.0000000000000002 ***
## as.factor(hour)12 0.1941731 0.0065186 29.788 < 0.0000000000000002 ***
## as.factor(hour)13 0.2100062 0.0066413 31.621 < 0.0000000000000002 ***
## as.factor(hour)14 0.2141320 0.0066531 32.185 < 0.0000000000000002 ***
## as.factor(hour)15 0.2537284 0.0066161 38.350 < 0.0000000000000002 ***
## as.factor(hour)16 0.2849932 0.0066197 43.052 < 0.0000000000000002 ***
## as.factor(hour)17 0.3844077 0.0066086 58.168 < 0.0000000000000002 ***
## as.factor(hour)18 0.1871858 0.0066621 28.097 < 0.0000000000000002 ***
## as.factor(hour)19 0.1086558 0.0067927 15.996 < 0.0000000000000002 ***
## as.factor(hour)20 0.0290851 0.0068172 4.266 0.000019865510120456 ***
## as.factor(hour)21 0.0540361 0.0066546 8.120 0.000000000000000467 ***
## as.factor(hour)22 0.0416605 0.0067647 6.158 0.000000000735060990 ***
## as.factor(hour)23 0.0245186 0.0067170 3.650 0.000262 ***
## dotw_simple2 -0.0287872 0.0037245 -7.729 0.000000000000010851 ***
## dotw_simple3 0.0106808 0.0038627 2.765 0.005691 **
## dotw_simple4 -0.0035308 0.0038896 -0.908 0.364002
## dotw_simple5 -0.0195895 0.0038010 -5.154 0.000000255476506511 ***
## dotw_simple6 -0.0699725 0.0038089 -18.371 < 0.0000000000000002 ***
## dotw_simple7 -0.0500267 0.0038534 -12.982 < 0.0000000000000002 ***
## Temperature 0.0026179 0.0001162 22.532 < 0.0000000000000002 ***
## Precipitation -0.5949000 0.0339493 -17.523 < 0.0000000000000002 ***
## lag1Hour 0.3835597 0.0014070 272.602 < 0.0000000000000002 ***
## lag3Hours 0.1095022 0.0013938 78.564 < 0.0000000000000002 ***
## lag1day 0.1246783 0.0013173 94.646 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6318 on 445461 degrees of freedom
## Multiple R-squared: 0.288, Adjusted R-squared: 0.2879
## F-statistic: 5299 on 34 and 445461 DF, p-value: < 0.00000000000000022
# Extract R² values
r2_model1 <- summary(model1)$r.squared
r2_model2 <- summary(model2)$r.squared
adj_r2_model1 <- summary(model1)$adj.r.squared
adj_r2_model2 <- summary(model2)$adj.r.squared
# Calculate improvements
r2_improvement <- r2_model2 - r2_model1
adj_r2_improvement <- adj_r2_model2 - adj_r2_model1
# Create comparison dataframe
model_comparison <- data.frame(
Model = c("Model 1 (No Lags)", "Model 2 (With Lags)", "Improvement"),
R_squared = c(r2_model1, r2_model2, r2_improvement),
Adj_R_squared = c(adj_r2_model1, adj_r2_model2, adj_r2_improvement),
Percent_Variance = c(r2_model1 * 100, r2_model2 * 100, r2_improvement * 100)
)
print(model_comparison, row.names = FALSE)## Model R_squared Adj_R_squared Percent_Variance
## Model 1 (No Lags) 0.0797915 0.07972746 7.97915
## Model 2 (With Lags) 0.2879868 0.28793250 28.79868
## Improvement 0.2081953 0.20820504 20.81953
Adding temporal lag variables increased the model’s explanatory power from approx. 8% to approx. 29% - a 21 percentage point improvement. In model 1, only about 8% of ridership is explained by hour-of-day, day-of-week, temperature, and precipitation. While not spectacular by any means, adding lag-hour-1, lag-hour-3, and lag-hour-day (accounting for data collected one hour ago, three hours ago, and one day ago), the model now predicts about 29% of ridership tendencies.
model3 <- lm(
Trip_Count ~ as.factor(hour) + dotw_simple + Temperature + Precipitation +
lag1Hour + lag3Hours + lag1day +
Med_Inc.x + Percent_Taking_Transit.y + Percent_White.y,
data = train
)
summary(model3)##
## Call:
## lm(formula = Trip_Count ~ as.factor(hour) + dotw_simple + Temperature +
## Precipitation + lag1Hour + lag3Hours + lag1day + Med_Inc.x +
## Percent_Taking_Transit.y + Percent_White.y, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5811 -0.4844 -0.2214 0.2805 10.9327
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 0.79563703104 0.03379946736 23.540
## as.factor(hour)1 0.00993510579 0.04139646645 0.240
## as.factor(hour)2 0.01190820057 0.04742786569 0.251
## as.factor(hour)3 -0.10410141993 0.06283855887 -1.657
## as.factor(hour)4 -0.04622792618 0.04778488492 -0.967
## as.factor(hour)5 0.05490017571 0.03562865250 1.541
## as.factor(hour)6 0.25269447535 0.03095583290 8.163
## as.factor(hour)7 0.33210493879 0.02946309573 11.272
## as.factor(hour)8 0.43832208054 0.02820832586 15.539
## as.factor(hour)9 0.11048040821 0.02861806592 3.861
## as.factor(hour)10 0.07696822769 0.02874259902 2.678
## as.factor(hour)11 0.10789086743 0.02863557779 3.768
## as.factor(hour)12 0.17632310406 0.02836458898 6.216
## as.factor(hour)13 0.18275449047 0.02828971500 6.460
## as.factor(hour)14 0.17001189142 0.02817802033 6.033
## as.factor(hour)15 0.21467263669 0.02799987232 7.667
## as.factor(hour)16 0.28721232519 0.02788905915 10.298
## as.factor(hour)17 0.50038106556 0.02779108127 18.005
## as.factor(hour)18 0.19715253016 0.02817631725 6.997
## as.factor(hour)19 0.09592941590 0.02885662899 3.324
## as.factor(hour)20 0.01814080506 0.02995163252 0.606
## as.factor(hour)21 0.04087870931 0.02994088872 1.365
## as.factor(hour)22 0.06734480687 0.03139099277 2.145
## as.factor(hour)23 0.03151691816 0.03285351493 0.959
## dotw_simple2 -0.03742796491 0.01165377363 -3.212
## dotw_simple3 0.00299430719 0.01133778696 0.264
## dotw_simple4 -0.03226296781 0.01138738385 -2.833
## dotw_simple5 -0.06672081195 0.01138243814 -5.862
## dotw_simple6 -0.08906181841 0.01207393451 -7.376
## dotw_simple7 -0.05334796403 0.01221079293 -4.369
## Temperature 0.00297627230 0.00039030472 7.626
## Precipitation -2.38881308670 0.16131889972 -14.808
## lag1Hour 0.24939883336 0.00274630369 90.813
## lag3Hours 0.06549188253 0.00290032270 22.581
## lag1day 0.10616171709 0.00286484284 37.057
## Med_Inc.x 0.00000002638 0.00000010243 0.258
## Percent_Taking_Transit.y -0.00177898711 0.00037723057 -4.716
## Percent_White.y 0.00247318709 0.00019377296 12.763
## Pr(>|t|)
## (Intercept) < 0.0000000000000002 ***
## as.factor(hour)1 0.810332
## as.factor(hour)2 0.801753
## as.factor(hour)3 0.097594 .
## as.factor(hour)4 0.333338
## as.factor(hour)5 0.123345
## as.factor(hour)6 0.000000000000000331 ***
## as.factor(hour)7 < 0.0000000000000002 ***
## as.factor(hour)8 < 0.0000000000000002 ***
## as.factor(hour)9 0.000113 ***
## as.factor(hour)10 0.007411 **
## as.factor(hour)11 0.000165 ***
## as.factor(hour)12 0.000000000511222058 ***
## as.factor(hour)13 0.000000000105167362 ***
## as.factor(hour)14 0.000000001610820730 ***
## as.factor(hour)15 0.000000000000017795 ***
## as.factor(hour)16 < 0.0000000000000002 ***
## as.factor(hour)17 < 0.0000000000000002 ***
## as.factor(hour)18 0.000000000002631454 ***
## as.factor(hour)19 0.000887 ***
## as.factor(hour)20 0.544735
## as.factor(hour)21 0.172158
## as.factor(hour)22 0.031927 *
## as.factor(hour)23 0.337402
## dotw_simple2 0.001320 **
## dotw_simple3 0.791704
## dotw_simple4 0.004609 **
## dotw_simple5 0.000000004596663688 ***
## dotw_simple6 0.000000000000164067 ***
## dotw_simple7 0.000012500376353510 ***
## Temperature 0.000000000000024547 ***
## Precipitation < 0.0000000000000002 ***
## lag1Hour < 0.0000000000000002 ***
## lag3Hours < 0.0000000000000002 ***
## lag1day < 0.0000000000000002 ***
## Med_Inc.x 0.796729
## Percent_Taking_Transit.y 0.000002409906833269 ***
## Percent_White.y < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8735 on 89219 degrees of freedom
## (356239 observations deleted due to missingness)
## Multiple R-squared: 0.182, Adjusted R-squared: 0.1817
## F-statistic: 536.6 on 37 and 89219 DF, p-value: < 0.00000000000000022
model4 <- lm(
Trip_Count ~ as.factor(hour) + dotw_simple + Temperature + Precipitation +
lag1Hour + lag3Hours + lag1day +
Med_Inc.x + Percent_Taking_Transit.y + Percent_White.y +
as.factor(start_station),
data = train
)
# Summary too long with all station dummies, just show key metrics
cat("Model 4 R-squared:", summary(model4)$r.squared, "\n")## Model 4 R-squared: 0.2071224
## Model 4 Adj R-squared: 0.2047524
Station fixed effects capture baseline differences in demand across Indego stations that aren’t explained by Census tract demographics, weather, or temporal patterns.
model5 <- lm(
Trip_Count ~ as.factor(hour) + dotw_simple + Temperature + Precipitation +
lag1Hour + lag3Hours + lag1day + rush_hour +
Med_Inc.x + Percent_Taking_Transit.y + Percent_White.y +
as.factor(start_station) +
rush_hour * weekend, # Rush hour effects different on weekends
data = train
)
cat("Model 5 R-squared:", summary(model5)$r.squared, "\n")## Model 5 R-squared: 0.2112196
## Model 5 Adj R-squared: 0.208853
# Get predictions on test set
# Create day of week factor with treatment (dummy) coding
test <- test %>%
mutate(dotw_simple = factor(dotw, levels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")))
# Set contrasts to treatment coding (dummy variables)
contrasts(test$dotw_simple) <- contr.treatment(7)
test <- test %>%
mutate(
pred1 = predict(model1, newdata = test),
pred2 = predict(model2, newdata = test),
pred3 = predict(model3, newdata = test),
pred4 = predict(model4, newdata = test),
pred5 = predict(model5, newdata = test)
)
# Calculate MAE for each model
mae_results <- data.frame(
Model = c(
"1. Time + Weather",
"2. + Temporal Lags",
"3. + Demographics",
"4. + Station FE",
"5. + Rush Hour Interaction"
),
MAE = c(
mean(abs(test$Trip_Count - test$pred1), na.rm = TRUE),
mean(abs(test$Trip_Count - test$pred2), na.rm = TRUE),
mean(abs(test$Trip_Count - test$pred3), na.rm = TRUE),
mean(abs(test$Trip_Count - test$pred4), na.rm = TRUE),
mean(abs(test$Trip_Count - test$pred5), na.rm = TRUE)
)
)
kable(mae_results,
digits = 2,
caption = "Mean Absolute Error by Model (Test Set)",
col.names = c("Model", "MAE (trips)")) %>%
kable_styling(bootstrap_options = c("striped", "hover"))| Model | MAE (trips) |
|---|---|
|
0.71 |
|
0.58 |
|
0.84 |
|
0.83 |
|
0.83 |
ggplot(mae_results, aes(x = reorder(Model, -MAE), y = MAE)) +
geom_col(fill = "#E1DCF5", alpha = 0.8) +
geom_text(aes(label = round(MAE, 2)), vjust = -0.5) +
labs(
title = "Model Performance Comparison",
subtitle = "Lower MAE = Better Predictions",
x = "Model",
y = "Mean Absolute Error (trips)"
) +
plotTheme +
theme(axis.text.x = element_text(angle = 45, hjust = 1))The temporal lag variables provided the largest improvement in predictive accuracy. Moving from Model 1 (time + weather only) to Model 2(adding lag1Hour, lag3Hours, lag1Day) reduced the MAE from 0.71 to 0.58, a solid improvement. The temporal lags are meant to capture “if X riders departed from Station #3010, for example, 1 hour ago, then X riders will depart from Station #3010 at the same time in the next hour.” Or, stations that were busy one hour ago are likely to remain busy in the next hour. Interestingly, adding demographic variables, station fixed effects, rush hour interactions, and later precipitation forecast, distance to nearest university, as well as station-type clustering feature implementation interactions actually increased the MAE, suggesting that these additions led to overfitting on the training data.
Let’s use our best model (Model 2, temporal lag) for error analysis.
test <- test %>%
mutate(
error = Trip_Count - pred2,
abs_error = abs(error),
time_of_day = case_when(
hour < 7 ~ "Overnight",
hour >= 7 & hour < 10 ~ "AM Rush",
hour >= 10 & hour < 15 ~ "Mid-Day",
hour >= 15 & hour <= 18 ~ "PM Rush",
hour > 18 ~ "Evening"
)
)
# Scatter plot by time and day type
ggplot(test, aes(x = Trip_Count, y = pred2)) +
geom_point(alpha = 0.2, color = "#6849BA") +
geom_abline(slope = 1, intercept = 0, color = "red", linewidth = 1) +
geom_smooth(method = "lm", se = FALSE, color = "darkgreen") +
facet_grid(weekend ~ time_of_day) +
labs(
title = "Observed vs. Predicted Bike Trips",
subtitle = "Model 2 performance by time period",
x = "Observed Trips",
y = "Predicted Trips",
caption = "Red line = perfect predictions; Green line = actual model fit"
) +
plotTheme## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 5592 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 5592 rows containing missing values or values outside the scale range
## (`geom_point()`).
The observed vs. predicted plots reveal systematic patterns in Model 2’s errors. The model performs best during overnight hours, where demand is consistently near zero and predictions tightly cluster along the perfect-prediction line. Performance degrades substantially during peak periods—particularly PM Rush on weekdays—where the green regression line diverges sharply from the red diagonal. The model exhibits a clear “ceiling effect,” rarely predicting more than 6-8 trips per station-hour even when actual demand reaches 20+ trips. This systematic underprediction of high-demand observations is operationally concerning: out-of-stock risk is highest precisely when demand spikes, and those are the predictions where accuracy matters most for rebalancing decisions. Weekend predictions (bottom row) show less extreme divergence, likely because weekend demand is lower and more evenly distributed throughout the day.
Are prediction errors clustered in certain parts of Philadelphia?
# Calculate station errors
station_errors <- test %>%
filter(!is.na(pred2)) %>%
group_by(start_station, start_lat.x, start_lon.y) %>%
summarize(
MAE = mean(abs(Trip_Count - pred2), na.rm = TRUE),
avg_demand = mean(Trip_Count, na.rm = TRUE),
total_trips = sum(Trip_Count),
.groups = "drop"
) %>%
filter(!is.na(start_lat.x), !is.na(start_lon.y))
# Map 1: Prediction Errors
p1 <- ggplot() +
geom_sf(data = philly_census, fill = "#f0f0f0", color = "white", linewidth = 0.3) +
geom_point(
data = station_errors,
aes(x = start_lon.y, y = start_lat.x, size = total_trips, color = MAE),
alpha = 0.8
) +
scale_color_gradientn(
colors = rev(palette5), # Use your palette
name = "Mean Absolute\nError (trips)",
breaks = seq(0, max(station_errors$MAE, na.rm = TRUE), length.out = 4),
labels = function(x) round(x, 2)
) +
scale_size_continuous(
name = "Total Trips",
range = c(2, 8),
breaks = c(100, 500, 1000, 2000),
labels = scales::comma
) +
labs(
title = "Model Prediction Errors by Station",
subtitle = "Larger points = more trips; Darker = higher error"
) +
coord_sf(xlim = c(-75.25, -75.00), ylim = c(39.88, 40.08)) +
mapTheme
# Map 2: Average Demand
p2 <- ggplot() +
geom_sf(data = philly_census, fill = "#f0f0f0", color = "white", linewidth = 0.3) +
geom_point(
data = station_errors,
aes(x = start_lon.y, y = start_lat.x, size = total_trips, color = avg_demand),
alpha = 0.8
) +
scale_color_gradientn(
colors = c("#08519c", "#3182bd", "#6baed6", "#bdd7e7", "#eff3ff"),
name = "Average\nDemand\n(trips/hour)",
breaks = seq(0, max(station_errors$avg_demand, na.rm = TRUE), length.out = 4),
labels = function(x) round(x, 2)
) +
scale_size_continuous(
name = "Total Trips",
range = c(2, 8),
breaks = c(100, 500, 1000, 2000),
labels = scales::comma
) +
labs(
title = "Average Demand by Station",
subtitle = "Darker blue = higher ridership concentration"
) +
coord_sf(xlim = c(-75.25, -75.00), ylim = c(39.88, 40.08)) +
mapTheme
# Combine with patchwork
p1 + p2 +
plot_annotation(
title = "Model 2 Performance: Spatial Distribution of Errors and Demand",
subtitle = "Philadelphia Indego Bike Share - 2024",
caption = "Source: Indego bike share trip data; Model 2 predictions (temporal lags)",
theme = theme(
plot.title = element_text(size = 14, face = "bold", hjust = 0.5),
plot.subtitle = element_text(size = 10, hjust = 0.5),
plot.caption = element_text(size = 8)
)
)The prediction errors show spatial clustering in Center City and University City having high MAE concentrations. This pattern suggests that the model struggles with high-volume stations that experience rapid demand fluctuations. Neighborhoods like East Passyunk, Point Breeze, Queen Village, Market East, and Northern Liberties/Fishtown show areas where demand is less and also more stable. The spatial concentration of errors in the urban core is operationally significant because these are the stations most likely to experience stockouts and where rebalancing decisions are most consequential.
When are we most wrong?
# MAE by time of day and day type
temporal_errors <- test %>%
group_by(time_of_day, weekend) %>%
summarize(
MAE = mean(abs_error, na.rm = TRUE),
.groups = "drop"
) %>%
mutate(day_type = ifelse(weekend == 1, "Weekend", "Weekday"))
ggplot(temporal_errors, aes(x = time_of_day, y = MAE, fill = day_type)) +
geom_col(position = "dodge") +
scale_fill_manual(values = c("Weekday" = "#08519c", "Weekend" = "#6baed6")) +
labs(
title = "Prediction Errors by Time Period",
subtitle = "When is the model struggling most?",
x = "Time of Day",
y = "Mean Absolute Error (trips)",
fill = "Day Type"
) +
plotTheme +
theme(axis.text.x = element_text(angle = 45, hjust = 1))Are prediction errors related to neighborhood characteristics?
# Join demographic data to station errors
station_errors_demo <- station_errors %>%
left_join(
station_attributes %>% dplyr::select(start_station, Med_Inc, Percent_Taking_Transit, Percent_White),
by = "start_station"
) %>%
filter(!is.na(Med_Inc))
# Create plots
p1 <- ggplot(station_errors_demo, aes(x = Med_Inc, y = MAE)) +
geom_point(alpha = 0.5, color = "#E1DCF5") +
geom_smooth(method = "lm", se = FALSE, color = "red") +
scale_x_continuous(labels = scales::dollar) +
labs(title = "Errors vs. Median Income", x = "Median Income", y = "MAE") +
plotTheme
p2 <- ggplot(station_errors_demo, aes(x = Percent_Taking_Transit, y = MAE)) +
geom_point(alpha = 0.5, color = "#E1DCF5") +
geom_smooth(method = "lm", se = FALSE, color = "red") +
labs(title = "Errors vs. Transit Usage", x = "Percent Taking Transit", y = "MAE") +
plotTheme
p3 <- ggplot(station_errors_demo, aes(x = Percent_White, y = MAE)) +
geom_point(alpha = 0.5, color = "#E1DCF5") +
geom_smooth(method = "lm", se = FALSE, color = "red") +
labs(title = "Errors vs. Race", x = "Percent White", y = "MAE") +
plotTheme
grid.arrange(p1, p2, p3, ncol = 2)## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
I engineered three new features targeting specific weaknesses in the baseline model:
1. Recent Rain (Precipitation Lag) The baseline model includes current precipitation, but people don’t immediately resume biking when rain stops. Obstacles like wet streets and wet seats influence rider choice. This “hangover effect” motivated creating a binary indicator for whether precipitation occurred in the previous 3 hours. The visualization confirms that ridership remains suppressed even after rain stops.
2. Distance to Nearest University The spatial error analysis showed elevated errors near University City, suggesting campus-related demand patterns the model wasn’t capturing. Universities generate distinctive ridership: class schedules create predictable hourly peaks, academic calendars can shift demand seasonally, and student demographics differ from typical commuters. I calculated distance from each station to the University of Pennsylvania, Drexel University, and Temple University, flagging stations within 0.5 miles as “near university.” Hub campuses like Temple University Center City campus and Saint Joesph’s Hawk Hill campus in University City are excluded.
3. Station Type Clustering Not all stations serve the same purpose. K-means clustering on hourly demand profiles revealed three distinct station types: Commuter Hubs (strong AM/PM peaks), Residential stations (more distributed demand), and Low Activity stations (consistently low ridership). This categorical feature allows the model to learn different demand dynamics for different station types rather than treating all stations identically.
study_panel_complete <- study_panel_complete %>%
arrange(start_station, interval60) %>%
group_by(start_station) %>%
mutate(
# Rolling 3-hour precipitation sum
precip_last_3hrs = lag(Precipitation, 1) + lag(Precipitation, 2) + lag(Precipitation, 3),
# Binary: any precipitation in last 3 hours?
recent_rain = ifelse(precip_last_3hrs > 0, 1, 0),
recent_rain = replace_na(recent_rain, 0)
) %>%
ungroup()
# Add to train/test
train <- study_panel_complete %>%
filter(week < 10, start_station %in% common_stations)
test <- study_panel_complete %>%
filter(week >= 10, start_station %in% common_stations)
# Recreate dotw_simple for both
train <- train %>%
mutate(dotw_simple = factor(dotw, levels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")))
contrasts(train$dotw_simple) <- contr.treatment(7)
test <- test %>%
mutate(dotw_simple = factor(dotw, levels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")))
contrasts(test$dotw_simple) <- contr.treatment(7)
# Visualize the impact
train %>%
filter(!is.na(recent_rain)) %>%
group_by(hour, recent_rain) %>%
summarize(avg_trips = mean(Trip_Count), .groups = "drop") %>%
ggplot(aes(x = hour, y = avg_trips, color = factor(recent_rain))) +
geom_line(linewidth = 1.2) +
scale_color_manual(
values = c("0" = "#3182bd", "1" = "grey60"),
labels = c("No Recent Rain", "Rain in Last 3 Hours"),
name = ""
) +
labs(
title = "Impact of Recent Precipitation on Demand",
subtitle = "Ridership remains suppressed even after rain stops",
x = "Hour of Day",
y = "Average Trips per Station"
) +
plotTheme +
theme(legend.position = "bottom")# Philadelphia university coordinates
universities <- data.frame(
name = c("Penn", "Drexel", "Temple"),
lon = c(-75.1932, -75.1897, -75.1492),
lat = c(39.9522, 39.9566, 39.9812)
)%>%
st_as_sf(coords = c("lon", "lat"), crs = 4326)
# Get unique stations as sf object
stations_sf_dist <- study_panel_complete %>%
distinct(start_station, start_lat.x, start_lon.x) %>%
filter(!is.na(start_lat.x), !is.na(start_lon.x)) %>%
st_as_sf(coords = c("start_lon.x", "start_lat.x"), crs = 4326)
dist_to_penn <- st_distance(stations_sf_dist, universities[1, ]) / 1609.34 # meters to miles
dist_to_drexel <- st_distance(stations_sf_dist, universities[2, ]) / 1609.34
dist_to_temple <- st_distance(stations_sf_dist, universities[3, ]) / 1609.34
# Add distances to dataframe
station_uni_dist <- stations_sf_dist %>%
st_drop_geometry() %>%
bind_cols(
dist_to_penn = as.numeric(dist_to_penn),
dist_to_drexel = as.numeric(dist_to_drexel),
dist_to_temple = as.numeric(dist_to_temple)
) %>%
mutate(
dist_to_nearest_uni = pmin(dist_to_penn, dist_to_drexel, dist_to_temple),
near_university = ifelse(dist_to_nearest_uni < 0.5, 1, 0) # Within 0.5 miles
)
# Need to get the coordinates back for mapping
station_uni_dist <- study_panel_complete %>%
distinct(start_station, start_lat.x, start_lon.x) %>%
filter(!is.na(start_lat.x), !is.na(start_lon.x)) %>%
left_join(station_uni_dist, by = "start_station")
cat("Stations with distance calculated:", nrow(station_uni_dist), "\n\n")## Stations with distance calculated: 233
# Join back to datasets
train <- train %>%
left_join(station_uni_dist %>% select(start_station, dist_to_nearest_uni, near_university),
by = "start_station")
test <- test %>%
left_join(station_uni_dist %>% select(start_station, dist_to_nearest_uni, near_university),
by = "start_station")
# Extract university coordinates for plotting
uni_coords <- universities %>%
st_coordinates() %>%
as.data.frame() %>%
bind_cols(name = c("Penn", "Drexel", "Temple"))
# Map university proximity
ggplot() +
geom_sf(data = philly_census, fill = "#f0f0f0", color = "white", linewidth = 0.3) +
# Universities
geom_point(
data = uni_coords,
aes(x = X, y = Y),
color = "#E1DCF5", size = 8, shape = 18 # Diamond shape
) +
geom_text(
data = uni_coords,
aes(x = X, y = Y, label = name),
vjust = -1.5, size = 3.5, fontface = "bold"
) +
# Stations colored by university proximity
geom_point(
data = station_uni_dist,
aes(x = start_lon.x, y = start_lat.x, color = dist_to_nearest_uni, size = near_university),
alpha = 0.7
) +
scale_color_gradientn(
colors = rev(palette5),
name = "Distance to\nNearest Uni\n(miles)",
limits = c(0, 5)
) +
scale_size_continuous(range = c(1.5, 4), guide = "none") +
labs(
title = "Indego Stations by University Proximity",
subtitle = "Larger points = within 0.5 miles of university"
) +
coord_sf(xlim = c(-75.25, -75.00), ylim = c(39.88, 40.08)) +
mapTheme# Compare demand near vs far from universities
train %>%
group_by(hour, near_university, dotw) %>%
summarize(avg_trips = mean(Trip_Count), .groups = "drop") %>%
filter(dotw %in% c("Mon", "Tue", "Wed", "Thu", "Fri")) %>%
ggplot(aes(x = hour, y = avg_trips, color = factor(near_university))) +
geom_line(linewidth = 1.2) +
scale_color_manual(
values = c("0" = "grey60", "1" = "#6849BA"),
labels = c("Away from Universities", "Near Universities (< 0.5 mi)"),
name = ""
) +
labs(
title = "Ridership Patterns: University vs Non-University Stations",
subtitle = "Weekdays only - University stations show distinct peaks",
x = "Hour of Day",
y = "Average Trips per Station"
) +
plotTheme +
theme(legend.position = "bottom") # Create hourly demand profiles for each station
station_profiles <- train %>%
group_by(start_station, hour) %>%
summarize(
avg_demand = mean(Trip_Count),
weekend_demand = mean(Trip_Count[weekend == 1]),
weekday_demand = mean(Trip_Count[weekend == 0]),
.groups = "drop"
) %>%
pivot_wider(
id_cols = start_station,
names_from = hour,
values_from = avg_demand,
names_prefix = "hour_"
)
# K-means clustering (3 types)
set.seed(123)
station_clusters <- station_profiles %>%
select(starts_with("hour_")) %>%
scale() %>%
kmeans(centers = 3, nstart = 25)
# Add cluster labels
station_profiles$station_type <- station_clusters$cluster
# Interpret clusters
cluster_summary <- train %>%
left_join(
station_profiles %>% select(start_station, station_type),
by = "start_station"
) %>%
group_by(station_type, hour) %>%
summarize(avg_demand = mean(Trip_Count), .groups = "drop")
#Visualize cluster patterns
ggplot(cluster_summary, aes(x = hour, y = avg_demand, color = factor(station_type))) +
geom_line(linewidth = 1.2) +
scale_color_manual(
values = c("1" = palette5[5], "2" = palette5[3], "3" = palette5[1]),
name = "Station Type",
labels = c("Type 1", "Type 2", "Type 3")
) +
labs(
title = "Station Types Based on Temporal Demand Patterns",
subtitle = "K-means clustering reveals distinct usage profiles",
x = "Hour of Day",
y = "Average Trips per Station"
) +
plotTheme +
theme(legend.position = "bottom")# Label the clusters based on patterns
station_profiles <- station_profiles %>%
mutate(
station_type_label = case_when(
station_type == 1 ~ "Commuter Hub", # High AM/PM peaks
station_type == 2 ~ "Residential", # Moderate, spread out
station_type == 3 ~ "Low Activity" # Consistently low
)
)
# Map station types
station_map_data <- train %>%
distinct(start_station, start_lat.x, start_lon.x) %>%
left_join(station_profiles %>% select(start_station, station_type_label),
by = "start_station")
ggplot() +
geom_sf(data = philly_census, fill = "#f0f0f0", color = "white", linewidth = 0.3) +
geom_point(
data = station_map_data,
aes(x = start_lon.x, y = start_lat.x, color = station_type_label),
size = 3, alpha = 0.7
) +
scale_color_manual(
values = c(
"Commuter Hub" = palette5[5],
"Residential" = palette5[3],
"Low Activity" = palette5[1]
),
name = "Station Type"
) +
labs(
title = "Station Types Across Philadelphia",
subtitle = "Clustered by temporal demand patterns"
) +
coord_sf(xlim = c(-75.25, -75.00), ylim = c(39.88, 40.08)) +
mapThemeNow we’ll test whether our three engineered features actually improve predictions.
We add our three new features to Model 5 (our best OLS model so far):
# Add to train/test
train <- study_panel_complete %>%
filter(week < 10, start_station %in% common_stations)
test <- study_panel_complete %>%
filter(week >= 10, start_station %in% common_stations)
# Recreate dotw_simple for both
train <- train %>%
mutate(dotw_simple = factor(dotw, levels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")))
contrasts(train$dotw_simple) <- contr.treatment(7)
test <- test %>%
mutate(dotw_simple = factor(dotw, levels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")))
contrasts(test$dotw_simple) <- contr.treatment(7)
# ADD THIS HERE - Re-join university data and station clusters
train <- train %>%
left_join(station_uni_dist %>% dplyr::select(start_station, dist_to_nearest_uni, near_university),
by = "start_station") %>%
left_join(station_profiles %>% dplyr::select(start_station, station_type_label),
by = "start_station")
test <- test %>%
left_join(station_uni_dist %>% dplyr::select(start_station, dist_to_nearest_uni, near_university),
by = "start_station") %>%
left_join(station_profiles %>% dplyr::select(start_station, station_type_label),
by = "start_station")
model6 <- lm(
Trip_Count ~ as.factor(hour) + dotw_simple + Temperature + Precipitation +
lag1Hour + lag3Hours + lag1day + rush_hour +
Med_Inc.x + Percent_Taking_Transit.x + Percent_White.x +
as.factor(start_station) +
rush_hour * weekend +
recent_rain +
near_university +
station_type_label,
data = train
)
cat("Model 6 Adj R-squared:", round(summary(model6)$adj.r.squared, 4), "\n")## Model 6 Adj R-squared: 0.2135
# Calculate MAE on test set
test$pred6 <- predict(model6, newdata = test)
mae_model6 <- mean(abs(test$Trip_Count - test$pred6), na.rm = TRUE)
cat("Model 6 MAE:", round(mae_model6, 3), "\n")## Model 6 MAE: 0.832
##
## Comparison to Model 2 (best baseline):
## Model 2 MAE: 0.577
## Model 6 MAE: 0.832
## Difference: 0.255 (positive = worse)
Despite these theoretically-motivated features, Model 6 (OLS with new features) achieved an MAE of 0.832—actually worse than the simple temporal lag model (0.577). This suggests overfitting: the new features may capture patterns present in the training period that don’t generalize to the test period.
# Poisson regression - better for count data
model7_poisson <- glm(
Trip_Count ~ as.factor(hour) + dotw_simple + Temperature + Precipitation +
lag1Hour + lag3Hours + lag1day + rush_hour +
Med_Inc.x + Percent_Taking_Transit.y + Percent_White.y +
as.factor(start_station) +
rush_hour * weekend +
recent_rain +
near_university +
station_type_label,
data = train,
family = poisson(link = "log")
)
# Check for overdispersion
dispersion <- model7_poisson$deviance / model7_poisson$df.residual
cat("Poisson Dispersion Ratio:", round(dispersion, 2), "\n")## Poisson Dispersion Ratio: 0.34
## (Should be ~1; if >> 1, data is overdispersed)
The Poisson regression (MAE = 0.824) performed better than Model 6 but still worse than Model 2. The Poisson regression yielded a dispersion ratio of 0.34, indicating the data is underdispersed. Trip counts have less variability than a Poisson distribution would predict. This suggests bike share demand is fairly regular and predictable at the hourly level, likely because strong temporal patterns (commute times, weather effects) constrain the variance.
test <- test %>%
mutate(dotw_simple = factor(dotw, levels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")))
contrasts(test$dotw_simple) <- contr.treatment(7)
test <- test %>%
mutate(
pred1 = predict(model1, newdata = test),
pred2 = predict(model2, newdata = test),
pred3 = predict(model3, newdata = test),
pred4 = predict(model4, newdata = test),
pred5 = predict(model5, newdata = test),
pred6 = predict(model6, newdata = test),
pred7_poisson = predict(model7_poisson, newdata = test, type = "response")
)
mae_results <- data.frame(
Model = c(
"1. Time + Weather",
"2. + Temporal Lags",
"3. + Demographics",
"4. + Station FE",
"5. + Rush Hour Interaction",
"6. + New Features (OLS)",
"7. Poisson"
),
MAE = c(
mean(abs(test$Trip_Count - test$pred1), na.rm = TRUE),
mean(abs(test$Trip_Count - test$pred2), na.rm = TRUE),
mean(abs(test$Trip_Count - test$pred3), na.rm = TRUE),
mean(abs(test$Trip_Count - test$pred4), na.rm = TRUE),
mean(abs(test$Trip_Count - test$pred5), na.rm = TRUE),
mean(abs(test$Trip_Count - test$pred6), na.rm = TRUE),
mean(abs(test$Trip_Count - test$pred7_poisson), na.rm = TRUE)
)
)
kable(mae_results,
digits = 3,
caption = "Mean Absolute Error by Model (Test Set)",
col.names = c("Model", "MAE (trips)")) %>%
kable_styling(bootstrap_options = c("striped", "hover"))| Model | MAE (trips) |
|---|---|
|
0.709 |
|
0.577 |
|
0.844 |
|
0.831 |
|
0.827 |
|
0.832 |
|
0.824 |
Despite the inclusion of new spatial features, the second model highlighting the temporal lags yield the lowest MAE (0.577), making it the best model of the analysis. The high MAE of new feature regressions could be indicative of overfitting.
Let’s review the implications of these findings.
# Make sure pred2 exists for all diagnostics
test$pred2 <- predict(model2, newdata = test)
# What's the distribution of actual trip counts?
cat("Trip Count Summary:\n")## Trip Count Summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.5998 1.0000 28.0000
##
##
## Trip count distribution:
table(cut(test$Trip_Count, breaks = c(-1, 0, 2, 5, 10, 15, Inf),
labels = c("0", "1-2", "3-5", "6-10", "11-15", "16+")))##
## 0 1-2 3-5 6-10 11-15 16+
## 1399403 499960 121034 18646 1036 69
# What % of station-hours have 0 trips?
cat("\n\nPercent of station-hours with 0 trips:",
round(mean(test$Trip_Count == 0) * 100, 1), "%\n")##
##
## Percent of station-hours with 0 trips: 68.6 %
##
##
## Average trips by station type:
test %>%
group_by(station_type_label) %>%
summarize(
avg_trips = mean(Trip_Count),
median_trips = median(Trip_Count),
max_trips = max(Trip_Count)
) %>%
print()## # A tibble: 3 × 4
## station_type_label avg_trips median_trips max_trips
## <chr> <dbl> <dbl> <int>
## 1 Commuter Hub 0.831 0 23
## 2 Low Activity 0.292 0 28
## 3 Residential 1.49 1 24
##
##
## MAE by demand level:
test %>%
mutate(
demand_level = case_when(
Trip_Count == 0 ~ "Zero trips",
Trip_Count <= 2 ~ "Low (1-2)",
Trip_Count <= 5 ~ "Medium (3-5)",
TRUE ~ "High (6+)"
),
pred2 = predict(model2, newdata = cur_data()), # recalculate inline
error = abs(Trip_Count - pred2)
) %>%
group_by(demand_level) %>%
summarize(
n = n(),
avg_actual = mean(Trip_Count),
MAE = mean(error, na.rm = TRUE),
relative_error_pct = mean(error / pmax(Trip_Count, 0.5)) * 100
)##
##
## MAE by time period:
test %>%
mutate(error = abs(Trip_Count - pred2)) %>%
group_by(rush_hour) %>%
summarize(
avg_trips = mean(Trip_Count),
MAE = mean(error)
) %>%
print()## # A tibble: 2 × 3
## rush_hour avg_trips MAE
## <dbl> <dbl> <dbl>
## 1 0 0.472 NA
## 2 1 0.988 NA
Operational Implications This model achieves a best-case MAE of 0.577 trips per station-hour using only temporal lag features (Model 2). At first glance this seems operationally useful, but context is critical: 68.6% of station-hours have zero trips, and the median is also zero. The mean of just 0.6 trips per station-hour means most predictions are for very low-demand situations. When we examine MAE by demand level, the model performs well for quiet periods (MAE = 0.35 for zero-trip hours) but struggles dramatically during high-demand periods (MAE = 4.97 for stations with 6+ trips). This “ceiling effect” is operationally concerning—the model underpredicts precisely when stockout risk is highest.
For Indego’s rebalancing operations, I would recommend deploying this model as a screening tool rather than an automated decision-maker. The model reliably identifies which stations will remain quiet, freeing operations staff to focus attention on the ~31% of station-hours with actual activity. However, for high-volume stations during peak periods, predictions should be treated as lower bounds rather than point estimates. A risk-averse deployment strategy might add a buffer to predictions at Residential stations (which average 1.49 trips/hour—nearly double Commuter Hubs at 0.83) and flag any station predicted above 3 trips as “high priority” for monitoring.
Equity Considerations The error analysis revealed that prediction errors do not systematically disadvantage lower-income neighborhoods or communities of color—if anything, errors are slightly higher in wealthier, whiter areas due to their higher ridership volumes. The strongest demographic pattern was transit usage: stations in transit-rich areas (60%+ taking transit) had substantially lower MAE than car-dependent areas. This suggests the model works best in the dense urban core where bike share integrates into predictable multimodal commuting.
From an equity standpoint, this pattern raises concerns for peripheral neighborhoods. Areas with lower transit usage—which may already have limited transportation options—receive less accurate predictions, potentially affecting service quality where reliable bike share access could provide the most benefit. To mitigate these risks, I would recommend: (1) establishing minimum service levels for all stations regardless of predicted demand; (2) regularly auditing model performance across demographic groups; (3) applying conservative prediction buffers in underserved areas to avoid systematic under-allocation; and (4) supplementing algorithmic predictions with community input about where bikes are actually needed.
Model Limitations Several important patterns remain outside this model’s predictive capacity. The station type analysis revealed unexpected results: “Low Activity” stations had the highest maximum trips (28), suggesting occasional demand spikes the model cannot anticipate—likely from special events or irregular usage patterns. The model also cannot account for Phillies games, university graduations, SEPTA disruptions, or construction detours that temporarily shift demand.
Perhaps most fundamentally, the model assumes future demand mirrors historical patterns. With 68.6% of observations being zeros, the model is essentially learning “most station-hours are quiet”—which is true but not operationally useful. The real value would come from accurately predicting the 1-2% of station-hours with 6+ trips, but that’s exactly where MAE is worst (4.97 trips). With additional time and data, I would prioritize: building separate models for high-volume vs. low-volume stations; incorporating event calendars; testing classification approaches (will this station-hour exceed 5 trips? yes/no) rather than regression; and adding real-time dock availability to identify stations already approaching capacity constraints.