library(tidyverse)
library(sf)
library(lubridate)
library(tigris)
library(tidycensus)
library(viridis)
library(riem)
library(gridExtra)
library(knitr)
library(kableExtra)
library(RSocrata)
library(gifski)
library(gganimate)
library(osmdata)
library(FNN)
library(geosphere)
library(spdep)
library(caret)
library(sjPlot)
library(sjmisc)
library(sjlabelled)
plotTheme <- function(base_size = 12) {
theme(
text = element_text( color = "black"),
plot.title = element_text(size = 14,colour = "black"),
plot.subtitle = element_text(face="italic"),
plot.caption = element_text(hjust=0),
axis.ticks = element_blank(),
panel.background = element_blank(),
panel.grid.major = element_line("grey80", size = 0.1),
panel.grid.minor = element_blank(),
panel.border = element_rect(colour = "black", fill=NA, size=2),
strip.background = element_rect(fill = "grey80", color = "white"),
strip.text = element_text(size=12),
axis.title = element_text(size=12),
axis.text = element_text(size=10),
plot.background = element_blank(),
legend.background = element_blank(),
legend.title = element_text(colour = "black", face = "italic"),
legend.text = element_text(colour = "black", face = "italic"),
strip.text.x = element_text(size = 14)
)
}
mapTheme <- function(base_size = 12) {
theme(
text = element_text( color = "black"),
plot.title = element_text(size = 14,colour = "black"),
plot.subtitle=element_text(face="italic"),
plot.caption=element_text(hjust=0),
axis.ticks = element_blank(),
panel.background = element_blank(),axis.title = element_blank(),
axis.text = element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
panel.grid.minor = element_blank(),
panel.border = element_rect(colour = "black", fill=NA, size=2)
)
}
palette5 <- c("#f6e620","#96d83f","#59c765","#1f948c","#375b8d")
palette3 <- c("#696969","grey","#ffc200")
palette2 <- c("#696969","#ffc200")
q5 <- function(variable) {as.factor(ntile(variable, 5))}
Automobiles are cells running through the vessels of any aggregation of human activities. With the introduction of smart technologies into a variety of aspects of planning, policymaker and planners struggled with long-lasting problems like air pollution, traffic congestion and real-time route planning have gained relief to some extent. However, the data-driven approach of parking, is still in the lime light.
We can predict for the occupancy of parking space, but is that all we can offer as a solution? With demand-based model, we can know where to find the most available parking spaces. And city planners are using dynamic pricing systems to adjust for the unbalanced parking demand through time and space. The solutions are proposed to reduce waiting and cruising time around high-demand regions, and increase the usage of some low-demand spaces.
Here’s more to consider:
It’s time to form a rebalancing approach that involves both the public and private sectors to offer an integrated solution.
We are Thunder Parking.
For detailed use cases, please watch our video.
As it is shown in the meter distribution map, majority of meters in San Francisco are located in Financial District and Downtown. Due to urban problems such as centralization and separation of work and residence, both neighborhoods are also the disaster areas for parking problems. Since the parking meters are set a extremely high density in the center city, we focus on the neighborhoods adjacent to the Financial District and Downtown for two reasons.
On the one hand, these neighborhoods are close to the city center, and the number of on-street parking spaces has not reached the maximum limitation. By predicting the parking demand in these neighborhoods, government could consider to build increase parking meters in these areas to ease the shortage of parking spaces in the city center. Secondly,these areas have the potential to develop the private parking space rental market. Unlike the city center, our focused neighborhoods are mainly residential. A great number of housing parking spaces are available during workdays. These families can consider renting residual parking spaces to workers living in the suburbs and working in the city.
grid.arrange(ncol=2,
ggplot() +
geom_sf(data = neighbor, fill = "azure") +
geom_sf(data = meters.sub2, size = .75, color='#297b8e',alpha=0.2) +
labs(title = "Spatial Distribution of Parking Meters, SF,2018",
caption = "Data resource: SF OpenData") +
mapTheme(),
ggplot() +
geom_sf(data = neighbor, fill = "azure") +
geom_sf(data = neighbor_all, aes(fill = legend), show.legend = , size= 0) +
scale_fill_manual(values = c("#297b8e","#ffc200")) +
labs(title = "Research Area") +
mapTheme())
There parking records are downloaded from San Francisco Opendata. Before exploratory analysis, we drop the records where parking start time is before or the same as the end time. To better predict the parking demand of each meter during different time interval. A panel is created to record hourly parking space occupancy for each meters, which contains more than 800 million rows in total.
# extract a subset of 5 week dataset located in research area
dat_neighbour <- dat_neighbour[!(dat_neighbour$SESSION_START_DT>=dat_neighbour$SESSION_END_DT),]
temp <- dat_neighbour %>%
mutate(interval60=floor_date(ymd_hms(SESSION_START_DT),unit='hour'),
interval60.start=floor_date(ymd_hms(SESSION_START_DT),unit='hour'),
#interval15.start=floor_date(ymd_hms(SESSION_START_DT),unit='15mins'),
interval60.end=floor_date(ymd_hms(SESSION_END_DT),unit='hour'),
#interval15.end=floor_date(ymd_hms(SESSION_END_DT),unit='15mins'),
week=week(interval60.start),
dotw=wday(interval60.start,label=TRUE),
session_duration_sec=ymd_hms(SESSION_END_DT)-ymd_hms(SESSION_START_DT))%>%
mutate(payment_per_hour= round(GROSS_PAID_AMT/as.numeric(session_duration_sec)*3600,3))
# construct a 5-week panel template with time intervals from weather.Panel and parking meters from the extracted subset
temp.study.panel <- expand.grid(interval60 = unique(weather.Panel$interval60),
POST_ID = unique(temp$POST_ID))
# join data to panel template
temp.parking.panel <- left_join(temp.study.panel, st_drop_geometry(temp)
%>% select(interval60,POST_ID,SESSION_START_DT,SESSION_END_DT,
interval60.start, interval60.end, METER_EVENT_TYPE,payment_per_hour),
by = c('interval60', 'POST_ID'))
# lags, lags, lags to fill the dataset
temp.occupied <- temp.parking.panel %>% select(interval60,POST_ID,SESSION_START_DT,SESSION_END_DT, interval60.start, interval60.end, METER_EVENT_TYPE) %>%
mutate(interval60.end = case_when(is.na(interval60.end)&lag(interval60.end, 1) >= interval60 ~ lag(interval60.end, 1) ,
TRUE ~ interval60.end),
interval60.start = case_when(is.na(interval60.start)&lag(interval60.end, 1) >= interval60 ~ lag(interval60.start, 1) ,
TRUE ~ interval60.start),
SESSION_START_DT = case_when(is.na(SESSION_START_DT)&lag(interval60.end, 1) >= interval60 ~ lag(SESSION_START_DT, 1) ,
TRUE ~ SESSION_START_DT),
SESSION_END_DT = case_when(is.na(SESSION_END_DT)&lag(interval60.end, 1) >= interval60 ~ lag(SESSION_END_DT, 1) ,
TRUE ~ SESSION_END_DT),
METER_EVENT_TYPE = case_when(is.na(METER_EVENT_TYPE)&lag(interval60.end, 1) >= interval60 ~ lag(METER_EVENT_TYPE, 1) ,
TRUE ~ METER_EVENT_TYPE)
) %>%
mutate(interval60.end = case_when(is.na(interval60.end)&lag(interval60.end, 2) >= interval60 ~ lag(interval60.end, 2) ,
TRUE ~ interval60.end),
interval60.start = case_when(is.na(interval60.start)&lag(interval60.end, 2) >= interval60 ~ lag(interval60.start, 2) ,
TRUE ~ interval60.start),
SESSION_START_DT = case_when(is.na(SESSION_START_DT)&lag(interval60.end, 2) >= interval60 ~ lag(SESSION_START_DT, 2) ,
TRUE ~ SESSION_START_DT),
SESSION_END_DT = case_when(is.na(SESSION_END_DT)&lag(interval60.end, 2) >= interval60 ~ lag(SESSION_END_DT, 2) ,
TRUE ~ SESSION_END_DT),
METER_EVENT_TYPE = case_when(is.na(METER_EVENT_TYPE)&lag(interval60.end, 2) >= interval60 ~ lag(METER_EVENT_TYPE, 2) ,
TRUE ~ METER_EVENT_TYPE)
) %>%
mutate(interval60.end = case_when(is.na(interval60.end)&lag(interval60.end, 3) >= interval60 ~ lag(interval60.end, 3) ,
TRUE ~ interval60.end),
interval60.start = case_when(is.na(interval60.start)&lag(interval60.end, 3) >= interval60 ~ lag(interval60.start, 3) ,
TRUE ~ interval60.start),
SESSION_START_DT = case_when(is.na(SESSION_START_DT)&lag(interval60.end, 3) >= interval60 ~ lag(SESSION_START_DT, 3) ,
TRUE ~ SESSION_START_DT),
SESSION_END_DT = case_when(is.na(SESSION_END_DT)&lag(interval60.end, 3) >= interval60 ~ lag(SESSION_END_DT, 3) ,
TRUE ~ SESSION_END_DT),
METER_EVENT_TYPE = case_when(is.na(METER_EVENT_TYPE)&lag(interval60.end, 3) >= interval60 ~ lag(METER_EVENT_TYPE, 3) ,
TRUE ~ METER_EVENT_TYPE)
) %>%
mutate(interval60.end = case_when(is.na(interval60.end)&lag(interval60.end, 4) >= interval60 ~ lag(interval60.end, 4) ,
TRUE ~ interval60.end),
interval60.start = case_when(is.na(interval60.start)&lag(interval60.end, 4) >= interval60 ~ lag(interval60.start, 4) ,
TRUE ~ interval60.start),
SESSION_START_DT = case_when(is.na(SESSION_START_DT)&lag(interval60.end, 4) >= interval60 ~ lag(SESSION_START_DT, 4) ,
TRUE ~ SESSION_START_DT),
SESSION_END_DT = case_when(is.na(SESSION_END_DT)&lag(interval60.end, 4) >= interval60 ~ lag(SESSION_END_DT, 4) ,
TRUE ~ SESSION_END_DT),
METER_EVENT_TYPE = case_when(is.na(METER_EVENT_TYPE)&lag(interval60.end, 4) >= interval60 ~ lag(METER_EVENT_TYPE, 4) ,
TRUE ~ METER_EVENT_TYPE)
) %>%
mutate(interval60.end = case_when(is.na(interval60.end)&lag(interval60.end, 5) >= interval60 ~ lag(interval60.end, 5) ,
TRUE ~ interval60.end),
interval60.start = case_when(is.na(interval60.start)&lag(interval60.end, 5) >= interval60 ~ lag(interval60.start, 5) ,
TRUE ~ interval60.start),
SESSION_START_DT = case_when(is.na(SESSION_START_DT)&lag(interval60.end, 5) >= interval60 ~ lag(SESSION_START_DT, 5) ,
TRUE ~ SESSION_START_DT),
SESSION_END_DT = case_when(is.na(SESSION_END_DT)&lag(interval60.end, 5) >= interval60 ~ lag(SESSION_END_DT, 5) ,
TRUE ~ SESSION_END_DT),
METER_EVENT_TYPE = case_when(is.na(METER_EVENT_TYPE)&lag(interval60.end, 5) >= interval60 ~ lag(METER_EVENT_TYPE, 5) ,
TRUE ~ METER_EVENT_TYPE)
) %>%
mutate(interval60.end = case_when(is.na(interval60.end)&lag(interval60.end, 6) >= interval60 ~ lag(interval60.end, 6) ,
TRUE ~ interval60.end),
interval60.start = case_when(is.na(interval60.start)&lag(interval60.end, 6) >= interval60 ~ lag(interval60.start, 6) ,
TRUE ~ interval60.start),
SESSION_START_DT = case_when(is.na(SESSION_START_DT)&lag(interval60.end, 6) >= interval60 ~ lag(SESSION_START_DT, 6) ,
TRUE ~ SESSION_START_DT),
SESSION_END_DT = case_when(is.na(SESSION_END_DT)&lag(interval60.end, 6) >= interval60 ~ lag(SESSION_END_DT, 6) ,
TRUE ~ SESSION_END_DT),
METER_EVENT_TYPE = case_when(is.na(METER_EVENT_TYPE)&lag(interval60.end, 6) >= interval60 ~ lag(METER_EVENT_TYPE, 6) ,
TRUE ~ METER_EVENT_TYPE)
) %>%
mutate(interval60.end = case_when(is.na(interval60.end)&lag(interval60.end, 7) >= interval60 ~ lag(interval60.end, 7) ,
TRUE ~ interval60.end),
interval60.start = case_when(is.na(interval60.start)&lag(interval60.end, 7) >= interval60 ~ lag(interval60.start, 7) ,
TRUE ~ interval60.start),
SESSION_START_DT = case_when(is.na(SESSION_START_DT)&lag(interval60.end, 7) >= interval60 ~ lag(SESSION_START_DT, 7) ,
TRUE ~ SESSION_START_DT),
SESSION_END_DT = case_when(is.na(SESSION_END_DT)&lag(interval60.end, 7) >= interval60 ~ lag(SESSION_END_DT, 7) ,
TRUE ~ SESSION_END_DT),
METER_EVENT_TYPE = case_when(is.na(METER_EVENT_TYPE)&lag(interval60.end, 7) >= interval60 ~ lag(METER_EVENT_TYPE, 7) ,
TRUE ~ METER_EVENT_TYPE)
)
temp.occupied.calibrated_1 <- temp.occupied %>%
mutate(
Ratio_Time_Occupied = case_when(
# session start and ends in the same hour
interval60.start == interval60 &
interval60.end == interval60 ~ as.numeric(ymd_hms(SESSION_END_DT) - ymd_hms(SESSION_START_DT))/3600, # precision: sec
# session ends in hour(s) later (not inherited)
interval60.start == interval60 &
interval60.end > interval60 ~ 1.0 - as.numeric(ymd_hms(SESSION_START_DT) - ymd_hms(interval60))/3600, # precision: min
# session inherited from previous and ends in the same hour
interval60.start < interval60 &
interval60.end == interval60 ~ as.numeric(ymd_hms(SESSION_END_DT) - ymd_hms(interval60))/3600, # precision:
# session inherited from previous and ends in hour(s) later
interval60.start < interval60 &
interval60.end > interval60 ~ 1.0),
Ratio_Time_Occupied = replace_na(Ratio_Time_Occupied, 0)
) %>%
mutate(Ratio_Time_Occupied = case_when(
# interval60 with both inherited sessions and initial sessions
interval60.start == interval60 &
lag(interval60.end, 1) == interval60 &
lag(interval60.start, 1) < interval60 &
ymd_hms(lag(SESSION_END_DT, 1)) < ymd_hms(SESSION_START_DT) ~ Ratio_Time_Occupied + as.numeric(ymd_hms(lag(SESSION_END_DT, 1)) - ymd_hms(interval60))/3600,
TRUE ~ as.numeric(Ratio_Time_Occupied)))
temp.occupied.calibrated_2 <- temp.occupied.calibrated_1 %>%
mutate(Ratio_Time_Occupied = case_when(# AT with starttime before the end of last session
METER_EVENT_TYPE == 'AT' & lag(METER_EVENT_TYPE, 1) == 'NS' &
ymd_hms(lag(SESSION_END_DT, 1)) >= ymd_hms(SESSION_START_DT) &
lag(interval60.end, 1) > interval60.start &
interval60.end > interval60.start ~ 0,
METER_EVENT_TYPE == 'AT' & lag(METER_EVENT_TYPE, 1) == 'NS' &
ymd_hms(lag(SESSION_END_DT, 1)) >= ymd_hms(SESSION_START_DT) &
lag(interval60.end, 1) == interval60.start &
interval60.end > interval60.start ~ as.numeric(ymd_hms(interval60.end) - ymd_hms(lag(SESSION_END_DT, 1)))/3600,
TRUE ~ as.numeric(Ratio_Time_Occupied)))
group.occupied <- temp.occupied.calibrated_2 %>%
group_by(interval60, POST_ID) %>% summarize(Ratio_Time_Occupied = round(sum(Ratio_Time_Occupied), 3))
#create parking_meter.panel
parking_meter.panel <-
temp.parking.panel %>%select(interval60,POST_ID,payment_per_hour)%>%
left_join(group.occupied, by = c("POST_ID" = "POST_ID","interval60" = "interval60")) %>%
left_join(weather.Panel, by = "interval60") %>%
left_join(meters.sub2, by = c("POST_ID" = "post_id")) %>%
mutate(week = week(interval60),
dotw = wday(interval60, label = TRUE))
# create time lags
parking_meter.panel <-
parking_meter.panel %>%
arrange(POST_ID, interval60) %>%
group_by(POST_ID) %>%
mutate(lagHour = dplyr::lag(Ratio_Time_Occupied,1),
lag2Hours = dplyr::lag(Ratio_Time_Occupied,2),
lag12Hours = dplyr::lag(Ratio_Time_Occupied,3),
lag1day = dplyr::lag(Ratio_Time_Occupied,24)) %>%
ungroup() %>%
mutate(day = yday(interval60)) %>%
# holiday: 1st day (NewYear), 15th day (MLK)
mutate(holiday = case_when(day == 1 ~ 1,
day == 15 ~ 1),
holiday = replace_na(holiday, 0))
# holiday lag (this is a strange method...but I haven't come up with better ones)
temp.holiday.lag <- st_drop_geometry(parking_meter.panel) %>%
select(day, holiday) %>% distinct() %>%
mutate(holidayLag = case_when(dplyr::lag(holiday, 1) == 1 ~ "PlusOneDay",
dplyr::lag(holiday, 2) == 1 ~ "PlusTwoDays",
dplyr::lag(holiday, 3) == 1 ~ "PlusThreeDays",
dplyr::lead(holiday, 1) == 1 ~ "MinusOneDay",
dplyr::lead(holiday, 2) == 1 ~ "MinusTwoDays",
dplyr::lead(holiday, 3) == 1 ~ "MinusThreeDays"),
holidayLag = replace_na(holidayLag, 0))
parking_meter.panel <- parking_meter.panel %>% left_join(., temp.holiday.lag)
parking_meter.panel <- subset(parking_meter.panel,select= -c(legend))
#fill NA
parking_meter.panel[is.na(parking_meter.panel)] <- 0
#add neighborhood name
parking_meter.panel <- parking_meter.panel%>%st_sf()%>%st_transform(crs=4326)%>%
st_join(neighbor_sub%>%st_transform(crs=4326),
join=st_intersects,left = TRUE)%>%
mutate(longitude = unlist(map(geometry, 1)),
latitude = unlist(map(geometry, 2)))%>%
na.omit()%>%
st_sf()
parking_meter.panel <- subset(parking_meter.panel,select= -c(legend))
# split training and test set
parking_meter.Train <- parking_meter.panel %>% filter( week == 2 | week == 3 | week == 4)
parking_meter.Test <- parking_meter.panel %>% filter(week == 1| week == 5)
length(parking_meter.Train$interval60) + length(parking_meter.Test$interval60)
Dependent variable, hourly occupancy of each parking space, is the what we want to predict with regression model. We split the panel into Training and Testing sets for modeling and validation. To ensure both of the training and testing sets contain a holiday, week1 and week5 are selected as the test data. Week2, week3, week4 are utilized for training.
Note that the actual parking time occupancy could be smaller than the recorded time period. Since the parking end time is recorded when drivers make a payment. In San Francisco, there are two payment methods. One is paying for the bill when drivers are leaving. The other is to estimate your parking time and pay for the parking fee in advance. Usually, drivers leave the parking space before the recorded time to avoid parking tickets when they choose the second payment method.
# parking_count serial autocorrelation
NewYear <- as.POSIXct("2018-01-01 01:00:00 UTC")
MLK <- as.POSIXct("2018-01-15 01:00:00 UTC")
st_drop_geometry(rbind(
mutate(parking_meter.Train, Legend = "Training"),
mutate(parking_meter.Test, Legend = "Testing"))) %>%
group_by(Legend, interval60) %>%
summarize(Ratio_Time_Occupied = sum(Ratio_Time_Occupied)) %>%
ungroup() %>%
ggplot(aes(interval60, Ratio_Time_Occupied, colour = Legend)) + geom_line() +
scale_colour_manual(values = palette2) +
geom_vline(xintercept = NewYear, linetype = "dotted", colour = 'red',size=1) +
geom_vline(xintercept = MLK, linetype = "dotted", colour = 'red',size=1) +
labs(title="Parking volumes, the first five weeks of 2018",
subtitle="Red Dotted lines for New Year's Day and Martin Luther King's Day",
x="Day", y="Parking Count") +
plotTheme() + theme(panel.grid.major = element_blank())
Independent variables are some features that may effects the demand of parking spaces. We divide these characteristics into five categories, including Parking Characteristics, Natural Features, Built Environment Features, Socioeconomics Features, Time Features and Spatial Features. We call these features as potential independent variables, since not all of the features listed in the table below are included in our final regression model.
Parking Characteristics: Characteristics included in the parking records, such as payment type and parking fee.
Natural Features: Weather data including percipitation, weed speed and temperature.
Built Environment Features: Distance to the nearest facilities and amenities.
Socioeconomics Features: Census data including total population, percent of poverty by tracts. Note that we also includes crime data in this part.
Time Features: One is time lag, it counts the mean parking occupancy in a recent time period. The other is holiday lag, it marks days before and after certain holiday.
Spatial Features: Neighborhood name is considered as a spatial feature.
Parking payment is decided by both of parking location and occupancy hours. The parking records is divided into 2 parts based on the median value of 1.5 dollars. As the charts demonstrate, drivers who payed less are more likely to park for additional time. It is obvious that divers prefer to pay cash when parking payment is less than 1.5 dollars. When the payment is above 1.5 dollars, count of credit card and cell phone payments boom.
meter.template %>%
dplyr::select(Parking_Counter, PAYMENT_TYPE, METER_EVENT_TYPE) %>%na.omit()%>%
gather(Variable, value, -Parking_Counter) %>%
count(Variable, value, Parking_Counter) %>%
ggplot(., aes(value, n, fill = Parking_Counter)) +
geom_bar(position = "dodge", stat="identity") +
facet_wrap(~Variable, scales="free") +
scale_fill_manual(values = palette2) +
labs(x="y", y="Value",
title = "Feature associations with the likelihood of taking the credit",
subtitle = "Categorical features") +
plotTheme()+
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Weather data of San Francisco from January to February, 2018 is downloaded from Iowa Environment Mesonet
using the riem
package.
Intuitively, weather condition seems not to influence drivers sitting in a car. However, for the residents who are used to take a bike or public transits, they may choose to drive at a rainy or windy day, which increase the potential parking demand.
weatherDataViz <- grid.arrange(
ggplot(weather.Panel, aes(interval60,Precipitation)) + geom_line(colour = "#ffc200",size = 0.8) +
labs(title="Precipitation", x="Hour", y="Precipitation") + plotTheme(),
ggplot(weather.Panel, aes(interval60,Wind_Speed)) + geom_line(colour = "#1f948c",size = 0.8) +
labs(title="Wind Speed", x="Hour", y="Wind Speed") + plotTheme(),
ggplot(weather.Panel, aes(interval60,Temperature)) + geom_line(colour = "#375b8d",size = 0.8) +
labs(title="Temperature", x="Hour", y="Temperature") + plotTheme(),
top="Weather Data - San Francisco - SFO - Jan 01 - Feb 04, 2018")
We assume that crime may influence parking choice, because people would change their parking location once their car was stolen. Theft from buildings are selected from the San Francisco Open crime data, 2018. As the map illustrates, The parking spaces in our research area is relatively safer than parking spaces within the city center. This could be a reason to encourage drivers to park outside the downtown and financial district.
ggplot() +
geom_sf(data = neighbor, fill = "white") +
geom_sf(data = sf.base, fill = "transparent", size=.75) +
geom_sf(data = meters.sub2, aes(colour = q5(theft_nn5)), show.legend = "point", size = .75, alpha=.5) +
scale_colour_manual(values = palette5,
labels = qBr(meters.sub2, "theft_nn5"),
name = "Quantile\nBreaks") +
geom_sf(data = roads_line, size = .75, color='#666666',alpha=0.1) +
geom_sf(data = neighbor_center,color = '#ee8138', fill = "transparent",size=1,)+
ylim(min(dat_neighbour$latitude+0.001), max(dat_neighbour$latitude+0.001))+
xlim(min(dat_neighbour$longitude+0.001), max(dat_neighbour$longitude+0.001))+
labs(title = "Average Distance to the 5 Closest Theft from Vehicles, SF,2018",
caption = "Data resource: SF OpenData") +
mapTheme()+
theme(panel.background = element_rect(fill = "#d5eff5"),panel.grid.major = element_blank())
Below is a map showing the distance from each parking spaces to the closest Muni Bus Station. As we can observe, majority of parking spaces in our research area are close to a bus station. It’s practical to park at the surrounding neighborhoods and take a bus to the downtown with only one or two stops. It’s much better than cruising for a parking space in downtown at A.M. rush.
ggplot() +
geom_sf(data = neighbor, fill = "white") +
geom_sf(data = sf.base, fill = "transparent", size=.75) +
geom_sf(data = meters.sub2, aes(colour = q5(stops_nn1)), show.legend = "point", size = .75, alpha=.5) +
scale_colour_manual(values = palette5,
labels = qBr(meters.sub2, "stops_nn1"),
name = "Quantile\nBreaks") +
geom_sf(data = roads_line, size = .75, color='#666666',alpha=0.1) +
geom_sf(data = neighbor_center,color = '#ee8138', fill = "transparent",size=1,)+
ylim(min(dat_neighbour$latitude+0.001), max(dat_neighbour$latitude+0.001))+
xlim(min(dat_neighbour$longitude+0.001), max(dat_neighbour$longitude+0.001))+
labs(title = "Distance to the Closest Bus Stops, SF,2018",
caption = "Data resource: SF OpenData") +
mapTheme()+
theme(panel.background = element_rect(fill = "#d5eff5"),panel.grid.major = element_blank())
Seeing from the series of scatter plots, the correlation between the hourly parking time occupancy and distance to the urban amenities and crimes is not significantly strong. We speculate reason is that urban facilities are relatively abundant within the scope of our study, so the distance discrepancy is not obvious among parking spaces.
# parking volumne as a function of census variables
# note: for week 4 here (faster loading)
# plotData_nn.census <-
# as.data.frame(parking_meter.Test2) %>%
# filter(week == 1) %>%
# select(POST_ID, Ratio_Time_Occupied) %>%
# group_by(POST_ID) %>%
# summarize(Ratio_Time_Occupied = sum(Ratio_Time_Occupied)) %>%
# left_join(meters.sub2_template, by=c("POST_ID" = "post_id")) %>%
# gather(Variable, Value, -POST_ID, -Ratio_Time_Occupied) %>%
# na.omit() %>%
# filter(Variable =="recreation_nn1"| Variable == "theft_nn5"|Variable == "stops_nn1"|
# Variable == "RoadDist"|Variable == "hos_nn1"|Variable == "faci_nn3")
# correlation calculation
correlation_nn.census <-
plotData_nn.census %>%
group_by(Variable) %>%
summarize(correlation = round(cor(Value, Ratio_Time_Occupied),2))
# plot
ggplot(plotData_nn.census, aes(Value,log(Ratio_Time_Occupied))) +
geom_point(color="grey",alpha = 0.1) +
geom_smooth(method="lm", se = F,color='#ffc200',size=1) +
facet_wrap(~Variable, scales="free", ncol=3) +
geom_text(data=correlation_nn.census, aes(label=paste("R =", correlation)),
colour = "black", x=-Inf, y=Inf, hjust=-0.1, vjust=1.2) +
plotTheme() +
labs(title="Parking Time Occupancy as a Function of Distance to Amenities and Crimes",
subtitle= "Week 1, 2018")
Below is an animated map showing the fluctuation of hourly parking time occupancy during a weekday. There are two main parking peak periods. One is between 9 a.m. to 12 p.m. and the other is between 2 p.m. and 5 p.m.. These two periods exactly correspond to the working hours of most companies. It indicates the parking spaces near downtown San Francisco mainly meets the parking demand of office workers.
week1 <-
filter(parking_meter.panel , week == 1 & dotw == "Tue")%>%
select(interval60, geometry, Ratio_Time_Occupied)
ride.animation.data <-
week1%>%
mutate(Occupancy = case_when(Ratio_Time_Occupied == 0 ~ "0%",
Ratio_Time_Occupied > 0 & Ratio_Time_Occupied <= 0.5 ~ "< 50%",
Ratio_Time_Occupied > 0.5 & Ratio_Time_Occupied <= 0.75 ~ "50-75%",
Ratio_Time_Occupied > 0.75 & Ratio_Time_Occupied <=0.95 ~ "75-95%",
Ratio_Time_Occupied > 0.95 ~ ">95%")) %>%
mutate(Occupancy = factor(Occupancy, levels=c("0%", "< 50%","50-75%","75-95%",">95%")))
rideshare_animation <-
ggplot() +
geom_sf(data = neighbor, fill = "white") +
geom_sf(data = sf.base, fill = "transparent", size=.75) +
geom_sf(data = roads_line, size = .75, color='#666666',alpha=0.1) +
geom_sf(data = ride.animation.data, aes(colour = Occupancy)) +
scale_colour_manual(values = palette5) +
labs(title = "Parking Time Occupancy by Meters for one day in January 2018",
subtitle = "60 minute intervals: {current_frame}") +
geom_sf(data = neighbor_center,color = '#ee8138', fill = "transparent",size=1,)+
ylim(min(dat_neighbour$latitude+0.001), max(dat_neighbour$latitude+0.001))+
xlim(min(dat_neighbour$longitude+0.001), max(dat_neighbour$longitude+0.001))+
transition_manual(interval60) +
mapTheme()+
theme(panel.background = element_rect(fill = "#d5eff5"),panel.grid.major = element_blank())
animate(rideshare_animation, duration=20, renderer = gifski_renderer())
Following is a series of maps demonstrate the close relationship between parking demand and office work schedule near downtown San Francisco. Parking volume is high during the mid-day of weekdays. The volume declined around PM rush, since people are off work and drive away from the downtown.MOst parking spaces are available during weekends.
ggplot()+
geom_sf(data = neighbor
%>%
st_transform(crs = 4326),color = "grey", fill = "transparent")+
geom_point(data = dat_neighbour %>%
mutate(hour = hour(SESSION_START_DT),
weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
group_by(POST_ID, latitude, longitude, weekend, time_of_day) %>%
tally(),
aes(x=longitude, y = latitude, color = n),
fill = "transparent", alpha = 1, size = 1.0)+
geom_sf(data = neighbor_center
%>%
st_transform(crs = 4326),color = '#666666', fill = "transparent",size=1)+
scale_colour_viridis(direction = -1,
discrete = FALSE, option = "D")+
ylim(min(dat_neighbour$latitude+0.001), max(dat_neighbour$latitude+0.001))+
xlim(min(dat_neighbour$longitude+0.001), max(dat_neighbour$longitude+0.001))+
facet_grid(weekend ~ time_of_day)+
labs(title="Parking Occupancy per hour by meters. SF, 2018")+
mapTheme
parking_meter.panel
is split into a training (week2, week3, week4) and a test set week1, week5) to ensure both of the sets including holidays. To compare and find out the best regreession model, three linear models are built using the ‘lm’ function:
# regression
reg1.m <-
lm(Ratio_Time_Occupied ~ + hour(interval60) + dotw + Temperature + Precipitation +
lagHour + lag2Hours +lag12Hours + holidayLag + holiday,
data=parking_meter.Train)
reg2.m <-
lm(Ratio_Time_Occupied ~ + hour(interval60) + dotw + Temperature + Precipitation +
lagHour + lag2Hours + +lag12Hours + holidayLag + holiday+
recreation_nn1 + theft_nn5 + stops_nn1 + RoadDist + edu_nn3 + hos_nn1 + faci_nn3,
data=parking_meter.Train)
reg3.m <-
lm(Ratio_Time_Occupied ~ + hour(interval60) + dotw + Temperature + Precipitation +
lagHour + lag2Hours + +lag12Hours + holidayLag + holiday+
theft_nn5 + stops_nn1 + RoadDist + edu_nn3 + hos_nn1 + faci_nn3 +
payment_per_hour + neighborho,
data=parking_meter.Train)
# predict for test data
parking_meter.Test.weekNest <-
parking_meter.Test %>%
nest(-week)
week_predictions <-
parking_meter.Test.weekNest %>%
mutate(ATime_Space_FE_timeLags_holidayLags = map(.x = data, fit = reg1.m, .f = model_pred),
BTime_Space_FE_timeLags_holidayLags_nn = map(.x = data, fit = reg2.m, .f = model_pred),
CTime_Space_FE_timeLags_holidayLags_nn_price = map(.x = data, fit = reg3.m, .f = model_pred)) %>%
gather(Regression, Prediction, -data, -week) %>%
mutate(Observed = map(data, pull, Ratio_Time_Occupied),
Absolute_Error = map2(Observed, Prediction, ~ abs(.x - .y)),
MAE = map_dbl(Absolute_Error, mean, na.rm = TRUE),
sd_AE = map_dbl(Absolute_Error, sd, na.rm = TRUE))
Comparing the improved regression with the baseline regression, the added features are all significant and improve the R-square from 0.531
to 0.584
. The closer R-square is to 1, the stronger the model’s ability to explain the independent variables. Clearly, 0.584
is not a great number and we will evaluate the reason in the discussion part.
# Regression results output
regression_table <-tab_model(reg1.m, reg3.m, auto.label = TRUE, show.ci = FALSE, show.intercept = FALSE,
title = 'Summary of Training Results',
dv.labels = c("Baseline Regression","Improved Regression"),
string.pred = "Coefficient")
Baseline Regression | Improved Regression | |||
---|---|---|---|---|
Coefficient | Estimates | p | Estimates | p |
hour(interval60) | 0.00 | <0.001 | 0.00 | <0.001 |
dotw [linear] | 0.01 | <0.001 | 0.01 | <0.001 |
dotw [quadratic] | -0.03 | <0.001 | -0.01 | <0.001 |
dotw [cubic] | 0.01 | <0.001 | 0.01 | <0.001 |
dotw [4th degree] | -0.01 | <0.001 | -0.00 | <0.001 |
dotw [5th degree] | 0.01 | <0.001 | 0.00 | <0.001 |
dotw [6th degree] | 0.00 | <0.001 | 0.00 | <0.001 |
Temperature | -0.01 | <0.001 | -0.00 | <0.001 |
Precipitation | 0.00 | <0.001 | 0.00 | <0.001 |
lagHour | 0.72 | <0.001 | 0.68 | <0.001 |
lag2Hours | -0.05 | <0.001 | -0.05 | <0.001 |
lag12Hours | 0.05 | <0.001 | 0.04 | <0.001 |
holidayLag [MinusOneDay] | 0.00 | <0.001 | 0.00 | <0.001 |
holidayLag [MinusThreeDays] |
0.01 | <0.001 | 0.01 | <0.001 |
holidayLag [MinusTwoDays] | 0.02 | <0.001 | 0.01 | <0.001 |
holidayLag [PlusOneDay] | 0.01 | <0.001 | 0.01 | <0.001 |
holidayLag [PlusThreeDays] |
0.00 | <0.001 | 0.00 | 0.015 |
holidayLag [PlusTwoDays] | 0.02 | <0.001 | 0.01 | <0.001 |
holiday | -0.00 | <0.001 | -0.00 | 0.016 |
theft_nn5 | 1.16 | <0.001 | ||
stops_nn1 | 2.32 | <0.001 | ||
RoadDist | 0.71 | 0.010 | ||
edu_nn3 | -0.19 | 0.003 | ||
hos_nn1 | -0.12 | <0.001 | ||
faci_nn3 | 1.76 | <0.001 | ||
payment_per_hour | 0.10 | <0.001 | ||
neighborho [Nob Hill] | 0.01 | <0.001 | ||
neighborho [North Beach] | 0.01 | <0.001 | ||
neighborho [South of Market] |
0.00 | 0.572 | ||
neighborho [Western Addition] |
0.00 | <0.001 | ||
Observations | 5087470 | 5087470 | ||
R2 / R2 adjusted | 0.531 / 0.531 | 0.584 / 0.584 |
Table 3.1
Mean Absolute Error(MAE) is absolute difference between observed and predicted value. Comparing the MAE results of three regression models, distance to the urban facilities doesn’t improve much of the accuracy within our research area. In the third regression model, hourly parking price and neighborhood features significantly improved the accuracy. This bar chart also indicate that prediction accuracy of week1 is better than that of week5. The potential reasons will be discussed following.
# MAE
week_predictions %>%
dplyr::select(week, Regression, MAE) %>%
gather(Variable, MAE, -Regression, -week) %>%
ggplot(aes(week, MAE)) +
geom_bar(aes(fill = Regression), position = "dodge", stat="identity", width=2) +
scale_fill_manual(values = palette3) +
labs(title = "Mean Absolute Errors by model specification and week") +
plotTheme()
Below is a line chart comparing the observed and predicted parking value by time series. It indicates that our final model performs great when predicting the low point and under-predicts the peak moments. The observed parking volume on Saturday, week5 is higher than normal situation. We look up San Francisco events and find an Independent Film Festival on that day. This also explains why the MAE of the week5 is higher than week1.
# plot predict/observed time series
week_predictions %>%
mutate(interval60 = map(data, pull, interval60),
POST_ID = map(data, pull, POST_ID)) %>%
dplyr::select(interval60, POST_ID, Observed, Prediction, Regression) %>%
unnest() %>%
gather(Variable, Value, -Regression, -interval60, -POST_ID) %>%
group_by(Regression, Variable, interval60) %>%
summarize(Value = sum(Value,na.rm=TRUE)) %>%
ggplot(aes(interval60, Value, colour=Variable)) +
scale_colour_manual("",values = c("Observed" = "darkgray","Prediction" = "#ffc200"))+
geom_line(size = 1.1) +
facet_wrap(~Regression, ncol=1) +
labs(title = "Predicted/Observed parking volume time series",
subtitle = "San Francisco; A test set of 1 week", x = "Hour", y= "Parking Volume") +
plotTheme()
From a spatial scope, our model performs better accuracy on parking spaces close to the Downtown and Financial District.
# validate test set by space
# error_by_week_space <- week_predictions %>%
# mutate(interval60 = map(data, pull, interval60),
# POST_ID = map(data, pull, POST_ID),
# latitude = map(data, pull, latitude),
# longitude = map(data, pull, longitude)) %>%
# select(interval60, POST_ID, latitude, longitude, Observed, Prediction, Regression) %>%
# unnest() %>%
# filter(Regression == "CTime_Space_FE_timeLags_holidayLags_nn_price") %>%
# group_by(POST_ID, latitude, longitude) %>%
# summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))
error_by_week_space %>%
ggplot(.)+
geom_sf(data = neighbor, fill = "white") +
geom_sf(data = sf.base, fill = "transparent", size=.75) +
geom_sf(data = roads_line, size = .75, color='#666666',alpha=0.1) +
geom_point(aes(x = longitude, y = latitude, color = MAE),
fill = "transparent", alpha = 1, size=1)+
scale_colour_viridis(direction = -1,
discrete = FALSE, option = "D")+
geom_sf(data = neighbor_center,color = '#ee8138', fill = "transparent",size=1,)+
labs(title="Mean Abs Error, Test Set")+
ylim(min(dat_neighbour$latitude+0.001), max(dat_neighbour$latitude+0.001))+
xlim(min(dat_neighbour$longitude+0.001), max(dat_neighbour$longitude+0.001))+
mapTheme()+
theme(panel.background = element_rect(fill = "#d5eff5"),panel.grid.major = element_blank())
Limited by the computer performance, we randomly select 800 thousand of rows from the study panel and ran 10 Fold regression. Cross-validation is a resampling process, which shuffles data randomly. The dataset will be firstly split into 100 groups, and one group will be taken as a holdout, and the remaining will become a training dataset. Then, the model is fitted in the training set and will be tested against the test set. As we can see, the CV MAE is close to MAE of previous prediction result, which indicate the accuracy is generalizable.
parking_sample <- parking_meter.panel[sample(nrow(parking_meter.panel), 800000), ]
fitControl <- trainControl(method = "cv", number = 10)
set.seed(825)
reg.cv <- train(Ratio_Time_Occupied ~ + hour(interval60) + dotw + Temperature + Precipitation +
lagHour + lag2Hours + +lag12Hours + holidayLag + holiday + theft_nn5 + stops_nn1 +
RoadDist + edu_nn3 + hos_nn1 + faci_nn3 + payment_per_hour,
data=parking_sample,
method = "lm", trControl = fitControl, na.action = na.pass)
reg.cv.results <-reg.cv$resample
The Root Mean Squared Error (RMSE) is 0.2066965 which is the standard deviation of the absolute error. The distribution of the standard deviation of MAE is shown in the histogram below. The range of MAE is from 0.0765 to 0.0804, which is a small interval.
ggplot(as.data.frame(reg.cv.results), aes(MAE)) +
geom_histogram(bins = 50, colour="white", fill = "#1f948c") +
labs(title="Regression C", subtitle = "k-fold cross validation; k = 10",
x="Mean Absolute Error (MAE)", y="Count") +
plotTheme()
Generalizability is tested based on the scope of time and space. From the perspective of time, as the series of scatter plots demonstrates, under-predicting problem consistently exists in different time interval .
# space time error evaluation
error_space_time <- week_predictions %>%
mutate(interval60 = map(data, pull, interval60),
POST_ID = map(data, pull, POST_ID),
latitude = map(data, pull, latitude),
longitude = map(data, pull, longitude),
dotw = map(data, pull, dotw)) %>%
select(interval60, POST_ID, longitude,
latitude, Observed, Prediction, Regression,
dotw) %>%
unnest() %>%
filter(Regression == "CTime_Space_FE_timeLags_holidayLags_nn_price")%>%
mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))
error_space_time %>%
ggplot()+
geom_point(aes(x= Observed, y = Prediction), alpha=0.5, colour = "grey")+
geom_smooth(aes(x= Observed, y= Prediction), method = "lm", se = FALSE, color = "#ffc200")+
geom_abline(slope = 1, intercept = 0)+
facet_grid(time_of_day~weekend)+
labs(title="Observed vs Predicted",
x="Observed trips",
y="Predicted trips")+
ylim(0,25)+
plotTheme()
Generally, the model performs great when predicting AM Rush and Overnight, which are also the valley of parking volume. For peak hours such as Mid-day and PM Rush, the model underpredicts the peak value and MAE is relatively greater. From a spatial perspective, Our model has better prediction ability for parking spaces closeer to the center area.
error_space_time %>%
group_by(POST_ID,time_of_day, latitude, longitude) %>%
summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
ggplot(.)+
geom_sf(data = neighbor, fill = "white") +
geom_sf(data = sf.base, fill = "transparent", size=.75) +
geom_sf(data = roads_line, size = .75, color='#666666',alpha=0.1) +
geom_point(aes(x = longitude, y = latitude, color = MAE),
fill = "transparent", alpha = 1, size=1)+
scale_colour_viridis(direction = -1,
discrete = FALSE, option = "D")+
geom_sf(data = neighbor_center,color = '#ee8138', fill = "transparent",size=1,)+
ylim(min(dat_neighbour$latitude+0.001), max(dat_neighbour$latitude+0.001))+
xlim(min(dat_neighbour$longitude+0.001), max(dat_neighbour$longitude+0.001))+
facet_grid(~time_of_day)+
labs(title="Mean Absolute Errors by Time Period, Test Set")+
mapTheme()+
theme(panel.background = element_rect(fill = "#d5eff5"),panel.grid.major = element_blank())
The advantages of our model are good scale precision and accuracy. In our study, we predict the hourly occupancy of every single parking spaces. The high scale precision brings great commercial application value. It could predict potential available parking space during a certain interval and inform the drivers. Besides the accruracy of model is also reliable. The MAE of cross validation is from 0.0765 to 0.0804, which demonstrate generalizable and high accuracy.
However, the limitation of the regression model is also obvious. First is the low R-Square (0.58).It means that only 58% of the variance for hourly parking time occupancy is explained by independent variables. One potential reason is the skewed dependent variable. There are too much zero values, since the parking spaces are occupied during work time and are empty in most cases.To deal with this problem, we will try other regression models, such as Zero-Inflated Poisson in future exploration.
Second is under-pediction. We’ll try to add new features, such as the distance to the office and Daily Events in San Francisco, to help the model better capture the peak.
Last is time precision. Limited by computer performance, the time precision of our model is one hour. It’s not enough for the real application. Therefore, we consider improve the precision to 15min and run a larger data on a cloud server.
Aggregate history/input data & model outputs at different levels
Model Errors are also precious
3.4. Socialeconomics Features
Form the perspective of socialeconomics features, percent of people taking public transportation, total population and travel time have a strong correlation with accumulated parking time occupancy. As the scatter plots demonstrates, in census tracts where people prefer taking public transportation, the parking demand is smaller than other places. Besides, accumulated parking time occupancy is positively correlated with total population and travel time.