Space-Time Prediction of Parking Demand in San Francisco

library(tidyverse)
library(sf)
library(lubridate)
library(tigris)
library(tidycensus)
library(viridis)
library(riem)
library(gridExtra)
library(knitr)
library(kableExtra)
library(RSocrata)
library(gifski)
library(gganimate)
library(osmdata)
library(FNN)
library(geosphere)
library(spdep)
library(caret)
library(sjPlot)
library(sjmisc)
library(sjlabelled)

plotTheme <- function(base_size = 12) {
  theme(
    text = element_text( color = "black"),
    plot.title = element_text(size = 14,colour = "black"),
    plot.subtitle = element_text(face="italic"),
    plot.caption = element_text(hjust=0),
    axis.ticks = element_blank(),
    panel.background = element_blank(),
    panel.grid.major = element_line("grey80", size = 0.1),
    panel.grid.minor = element_blank(),
    panel.border = element_rect(colour = "black", fill=NA, size=2),
    strip.background = element_rect(fill = "grey80", color = "white"),
    strip.text = element_text(size=12),
    axis.title = element_text(size=12),
    axis.text = element_text(size=10),
    plot.background = element_blank(),
    legend.background = element_blank(),
    legend.title = element_text(colour = "black", face = "italic"),
    legend.text = element_text(colour = "black", face = "italic"),
    strip.text.x = element_text(size = 14)
  )
}

mapTheme <- function(base_size = 12) {
  theme(
    text = element_text( color = "black"),
    plot.title = element_text(size = 14,colour = "black"),
    plot.subtitle=element_text(face="italic"),
    plot.caption=element_text(hjust=0),
    axis.ticks = element_blank(),
    panel.background = element_blank(),axis.title = element_blank(),
    axis.text = element_blank(),
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    panel.grid.minor = element_blank(),
    panel.border = element_rect(colour = "black", fill=NA, size=2)
  )
}

palette5 <- c("#f6e620","#96d83f","#59c765","#1f948c","#375b8d")
palette3 <- c("#696969","grey","#ffc200")
palette2 <- c("#696969","#ffc200")

q5 <- function(variable) {as.factor(ntile(variable, 5))}

1. Introduction

1.1. Background

Automobiles are cells running through the vessels of any aggregation of human activities. With the introduction of smart technologies into a variety of aspects of planning, policymaker and planners struggled with long-lasting problems like air pollution, traffic congestion and real-time route planning have gained relief to some extent. However, the data-driven approach of parking, is still in the lime light.

We can predict for the occupancy of parking space, but is that all we can offer as a solution? With demand-based model, we can know where to find the most available parking spaces. And city planners are using dynamic pricing systems to adjust for the unbalanced parking demand through time and space. The solutions are proposed to reduce waiting and cruising time around high-demand regions, and increase the usage of some low-demand spaces.

Here’s more to consider:

what if the predicted free space seen by a driver is quickly taken by another driver who also happens to see the information?
how do we deal with the situation that many users are requesting info. of the same region?
what if the predicted low-demand zones are quickly “replenished” with cars because they have noticed the convenience and cheaper prices compared to the center zones?
what if we can not just adjust for the demand-side, but encourage the extension of supply-side?

It’s time to form a rebalancing approach that involves both the public and private sectors to offer an integrated solution.

We are Thunder Parking.

Demand-drive, with more supply side power in market.
Efficiency-wise. Not just plans for parking, but an integrated design with other traffic modes.
Space/Time model based, but involving the active participation of multiple group of users.

For detailed use cases, please watch our video.

1.2. Research Area

As it is shown in the meter distribution map, majority of meters in San Francisco are located in Financial District and Downtown. Due to urban problems such as centralization and separation of work and residence, both neighborhoods are also the disaster areas for parking problems. Since the parking meters are set a extremely high density in the center city, we focus on the neighborhoods adjacent to the Financial District and Downtown for two reasons.

On the one hand, these neighborhoods are close to the city center, and the number of on-street parking spaces has not reached the maximum limitation. By predicting the parking demand in these neighborhoods, government could consider to build increase parking meters in these areas to ease the shortage of parking spaces in the city center. Secondly,these areas have the potential to develop the private parking space rental market. Unlike the city center, our focused neighborhoods are mainly residential. A great number of housing parking spaces are available during workdays. These families can consider renting residual parking spaces to workers living in the suburbs and working in the city.

grid.arrange(ncol=2,
ggplot() + 
  geom_sf(data = neighbor, fill = "azure") +
  geom_sf(data = meters.sub2, size = .75, color='#297b8e',alpha=0.2) + 
  labs(title = "Spatial Distribution of Parking Meters, SF,2018",
       caption = "Data resource: SF OpenData") + 
  mapTheme(),

ggplot() + 
  geom_sf(data = neighbor, fill = "azure") +
  geom_sf(data = neighbor_all, aes(fill = legend), show.legend = , size= 0) +
  scale_fill_manual(values = c("#297b8e","#ffc200")) +
  labs(title = "Research Area") + 
  mapTheme())

2. Data

2.1 Data Cleaning and Organization

There parking records are downloaded from San Francisco Opendata. Before exploratory analysis, we drop the records where parking start time is before or the same as the end time. To better predict the parking demand of each meter during different time interval. A panel is created to record hourly parking space occupancy for each meters, which contains more than 800 million rows in total.

# extract a subset of 5 week dataset located in research area
dat_neighbour <- dat_neighbour[!(dat_neighbour$SESSION_START_DT>=dat_neighbour$SESSION_END_DT),]

temp <- dat_neighbour %>% 
  mutate(interval60=floor_date(ymd_hms(SESSION_START_DT),unit='hour'),
         interval60.start=floor_date(ymd_hms(SESSION_START_DT),unit='hour'),
         #interval15.start=floor_date(ymd_hms(SESSION_START_DT),unit='15mins'),
         interval60.end=floor_date(ymd_hms(SESSION_END_DT),unit='hour'),
         #interval15.end=floor_date(ymd_hms(SESSION_END_DT),unit='15mins'),
         week=week(interval60.start),
         dotw=wday(interval60.start,label=TRUE),
         session_duration_sec=ymd_hms(SESSION_END_DT)-ymd_hms(SESSION_START_DT))%>%
  mutate(payment_per_hour= round(GROSS_PAID_AMT/as.numeric(session_duration_sec)*3600,3))

# construct a 5-week panel template with time intervals from weather.Panel and parking meters from the extracted subset
temp.study.panel <- expand.grid(interval60 = unique(weather.Panel$interval60),
                                POST_ID = unique(temp$POST_ID))

# join data to panel template
temp.parking.panel <- left_join(temp.study.panel, st_drop_geometry(temp)
                                %>% select(interval60,POST_ID,SESSION_START_DT,SESSION_END_DT, 
                                           interval60.start, interval60.end, METER_EVENT_TYPE,payment_per_hour), 
                                by = c('interval60', 'POST_ID'))

# lags, lags, lags to fill the dataset
temp.occupied <- temp.parking.panel %>% select(interval60,POST_ID,SESSION_START_DT,SESSION_END_DT, interval60.start, interval60.end, METER_EVENT_TYPE) %>%
  mutate(interval60.end = case_when(is.na(interval60.end)&lag(interval60.end, 1) >= interval60 ~ lag(interval60.end, 1) ,
                                    TRUE ~ interval60.end),
         interval60.start = case_when(is.na(interval60.start)&lag(interval60.end, 1) >= interval60 ~ lag(interval60.start, 1) ,
                                      TRUE ~ interval60.start),
         SESSION_START_DT = case_when(is.na(SESSION_START_DT)&lag(interval60.end, 1) >= interval60 ~ lag(SESSION_START_DT, 1) ,
                                      TRUE ~ SESSION_START_DT),
         SESSION_END_DT = case_when(is.na(SESSION_END_DT)&lag(interval60.end, 1) >= interval60 ~ lag(SESSION_END_DT, 1) ,
                                    TRUE ~ SESSION_END_DT),
         METER_EVENT_TYPE = case_when(is.na(METER_EVENT_TYPE)&lag(interval60.end, 1) >= interval60 ~ lag(METER_EVENT_TYPE, 1) ,
                                      TRUE ~ METER_EVENT_TYPE)
  ) %>% 
  mutate(interval60.end = case_when(is.na(interval60.end)&lag(interval60.end, 2) >= interval60 ~ lag(interval60.end, 2) ,
                                    TRUE ~ interval60.end),
         interval60.start = case_when(is.na(interval60.start)&lag(interval60.end, 2) >= interval60 ~ lag(interval60.start, 2) ,
                                      TRUE ~ interval60.start),
         SESSION_START_DT = case_when(is.na(SESSION_START_DT)&lag(interval60.end, 2) >= interval60 ~ lag(SESSION_START_DT, 2) ,
                                      TRUE ~ SESSION_START_DT),
         SESSION_END_DT = case_when(is.na(SESSION_END_DT)&lag(interval60.end, 2) >= interval60 ~ lag(SESSION_END_DT, 2) ,
                                    TRUE ~ SESSION_END_DT),
         METER_EVENT_TYPE = case_when(is.na(METER_EVENT_TYPE)&lag(interval60.end, 2) >= interval60 ~ lag(METER_EVENT_TYPE, 2) ,
                                      TRUE ~ METER_EVENT_TYPE)
  ) %>%
  mutate(interval60.end = case_when(is.na(interval60.end)&lag(interval60.end, 3) >= interval60 ~ lag(interval60.end, 3) ,
                                    TRUE ~ interval60.end),
         interval60.start = case_when(is.na(interval60.start)&lag(interval60.end, 3) >= interval60 ~ lag(interval60.start, 3) ,
                                      TRUE ~ interval60.start),
         SESSION_START_DT = case_when(is.na(SESSION_START_DT)&lag(interval60.end, 3) >= interval60 ~ lag(SESSION_START_DT, 3) ,
                                      TRUE ~ SESSION_START_DT),
         SESSION_END_DT = case_when(is.na(SESSION_END_DT)&lag(interval60.end, 3) >= interval60 ~ lag(SESSION_END_DT, 3) ,
                                    TRUE ~ SESSION_END_DT),
         METER_EVENT_TYPE = case_when(is.na(METER_EVENT_TYPE)&lag(interval60.end, 3) >= interval60 ~ lag(METER_EVENT_TYPE, 3) ,
                                      TRUE ~ METER_EVENT_TYPE)
  ) %>%
  mutate(interval60.end = case_when(is.na(interval60.end)&lag(interval60.end, 4) >= interval60 ~ lag(interval60.end, 4) ,
                                    TRUE ~ interval60.end),
         interval60.start = case_when(is.na(interval60.start)&lag(interval60.end, 4) >= interval60 ~ lag(interval60.start, 4) ,
                                      TRUE ~ interval60.start),
         SESSION_START_DT = case_when(is.na(SESSION_START_DT)&lag(interval60.end, 4) >= interval60 ~ lag(SESSION_START_DT, 4) ,
                                      TRUE ~ SESSION_START_DT),
         SESSION_END_DT = case_when(is.na(SESSION_END_DT)&lag(interval60.end, 4) >= interval60 ~ lag(SESSION_END_DT, 4) ,
                                    TRUE ~ SESSION_END_DT),
         METER_EVENT_TYPE = case_when(is.na(METER_EVENT_TYPE)&lag(interval60.end, 4) >= interval60 ~ lag(METER_EVENT_TYPE, 4) ,
                                      TRUE ~ METER_EVENT_TYPE)
  ) %>%
  mutate(interval60.end = case_when(is.na(interval60.end)&lag(interval60.end, 5) >= interval60 ~ lag(interval60.end, 5) ,
                                    TRUE ~ interval60.end),
         interval60.start = case_when(is.na(interval60.start)&lag(interval60.end, 5) >= interval60 ~ lag(interval60.start, 5) ,
                                      TRUE ~ interval60.start),
         SESSION_START_DT = case_when(is.na(SESSION_START_DT)&lag(interval60.end, 5) >= interval60 ~ lag(SESSION_START_DT, 5) ,
                                      TRUE ~ SESSION_START_DT),
         SESSION_END_DT = case_when(is.na(SESSION_END_DT)&lag(interval60.end, 5) >= interval60 ~ lag(SESSION_END_DT, 5) ,
                                    TRUE ~ SESSION_END_DT),
         METER_EVENT_TYPE = case_when(is.na(METER_EVENT_TYPE)&lag(interval60.end, 5) >= interval60 ~ lag(METER_EVENT_TYPE, 5) ,
                                      TRUE ~ METER_EVENT_TYPE)
  ) %>%
  mutate(interval60.end = case_when(is.na(interval60.end)&lag(interval60.end, 6) >= interval60 ~ lag(interval60.end, 6) ,
                                    TRUE ~ interval60.end),
         interval60.start = case_when(is.na(interval60.start)&lag(interval60.end, 6) >= interval60 ~ lag(interval60.start, 6) ,
                                      TRUE ~ interval60.start),
         SESSION_START_DT = case_when(is.na(SESSION_START_DT)&lag(interval60.end, 6) >= interval60 ~ lag(SESSION_START_DT, 6) ,
                                      TRUE ~ SESSION_START_DT),
         SESSION_END_DT = case_when(is.na(SESSION_END_DT)&lag(interval60.end, 6) >= interval60 ~ lag(SESSION_END_DT, 6) ,
                                    TRUE ~ SESSION_END_DT),
         METER_EVENT_TYPE = case_when(is.na(METER_EVENT_TYPE)&lag(interval60.end, 6) >= interval60 ~ lag(METER_EVENT_TYPE, 6) ,
                                      TRUE ~ METER_EVENT_TYPE)
  ) %>%
  mutate(interval60.end = case_when(is.na(interval60.end)&lag(interval60.end, 7) >= interval60 ~ lag(interval60.end, 7) ,
                                    TRUE ~ interval60.end),
         interval60.start = case_when(is.na(interval60.start)&lag(interval60.end, 7) >= interval60 ~ lag(interval60.start, 7) ,
                                      TRUE ~ interval60.start),
         SESSION_START_DT = case_when(is.na(SESSION_START_DT)&lag(interval60.end, 7) >= interval60 ~ lag(SESSION_START_DT, 7) ,
                                      TRUE ~ SESSION_START_DT),
         SESSION_END_DT = case_when(is.na(SESSION_END_DT)&lag(interval60.end, 7) >= interval60 ~ lag(SESSION_END_DT, 7) ,
                                    TRUE ~ SESSION_END_DT),
         METER_EVENT_TYPE = case_when(is.na(METER_EVENT_TYPE)&lag(interval60.end, 7) >= interval60 ~ lag(METER_EVENT_TYPE, 7) ,
                                      TRUE ~ METER_EVENT_TYPE)
  )

temp.occupied.calibrated_1 <- temp.occupied %>%
  mutate(
    Ratio_Time_Occupied = case_when(
      # session start and ends in the same hour
      interval60.start == interval60 &
        interval60.end == interval60 ~ as.numeric(ymd_hms(SESSION_END_DT) - ymd_hms(SESSION_START_DT))/3600,  # precision: sec
      # session ends in hour(s) later (not inherited)
      interval60.start == interval60 &
        interval60.end > interval60 ~ 1.0 - as.numeric(ymd_hms(SESSION_START_DT) - ymd_hms(interval60))/3600, # precision: min
      # session inherited from previous and ends in the same hour
      interval60.start < interval60 &
        interval60.end == interval60 ~ as.numeric(ymd_hms(SESSION_END_DT) - ymd_hms(interval60))/3600, # precision: 
      # session inherited from previous and ends in hour(s) later
      interval60.start < interval60 &
        interval60.end > interval60 ~ 1.0),
    Ratio_Time_Occupied = replace_na(Ratio_Time_Occupied, 0)
  ) %>%
  mutate(Ratio_Time_Occupied = case_when(
    # interval60 with both inherited sessions and initial sessions
    interval60.start == interval60 &
      lag(interval60.end, 1) == interval60 &
      lag(interval60.start, 1) < interval60 &
      ymd_hms(lag(SESSION_END_DT, 1)) < ymd_hms(SESSION_START_DT) ~ Ratio_Time_Occupied + as.numeric(ymd_hms(lag(SESSION_END_DT, 1)) - ymd_hms(interval60))/3600,  
    TRUE ~ as.numeric(Ratio_Time_Occupied))) 

temp.occupied.calibrated_2 <- temp.occupied.calibrated_1 %>%
  mutate(Ratio_Time_Occupied = case_when(# AT with starttime before the end of last session
    METER_EVENT_TYPE == 'AT' & lag(METER_EVENT_TYPE, 1) == 'NS' &
      ymd_hms(lag(SESSION_END_DT, 1)) >= ymd_hms(SESSION_START_DT) &
      lag(interval60.end, 1) > interval60.start &
      interval60.end > interval60.start ~ 0,
    METER_EVENT_TYPE == 'AT' & lag(METER_EVENT_TYPE, 1) == 'NS' &
      ymd_hms(lag(SESSION_END_DT, 1)) >= ymd_hms(SESSION_START_DT) &
      lag(interval60.end, 1) == interval60.start &
      interval60.end > interval60.start ~ as.numeric(ymd_hms(interval60.end) - ymd_hms(lag(SESSION_END_DT, 1)))/3600,
    TRUE ~ as.numeric(Ratio_Time_Occupied)))

group.occupied <- temp.occupied.calibrated_2 %>%
  group_by(interval60, POST_ID) %>% summarize(Ratio_Time_Occupied = round(sum(Ratio_Time_Occupied), 3))

#create parking_meter.panel
parking_meter.panel <-
  temp.parking.panel %>%select(interval60,POST_ID,payment_per_hour)%>%
  left_join(group.occupied, by = c("POST_ID" = "POST_ID","interval60" = "interval60")) %>%
  left_join(weather.Panel, by = "interval60") %>%
  left_join(meters.sub2, by = c("POST_ID" = "post_id")) %>%
  mutate(week = week(interval60),
         dotw = wday(interval60, label = TRUE))

# create time lags
parking_meter.panel <- 
  parking_meter.panel %>% 
  arrange(POST_ID, interval60) %>% 
  group_by(POST_ID) %>%
  mutate(lagHour = dplyr::lag(Ratio_Time_Occupied,1),
         lag2Hours = dplyr::lag(Ratio_Time_Occupied,2),
         lag12Hours = dplyr::lag(Ratio_Time_Occupied,3),
         lag1day = dplyr::lag(Ratio_Time_Occupied,24)) %>%
  ungroup() %>%
  mutate(day = yday(interval60)) %>%
  # holiday: 1st day (NewYear), 15th day (MLK)
  mutate(holiday = case_when(day == 1 ~ 1,
                             day == 15 ~ 1),
         holiday = replace_na(holiday, 0))

# holiday lag (this is a strange method...but I haven't come up with better ones)
temp.holiday.lag <- st_drop_geometry(parking_meter.panel) %>%
  select(day, holiday) %>% distinct() %>% 
  mutate(holidayLag = case_when(dplyr::lag(holiday, 1) == 1 ~ "PlusOneDay",
                                dplyr::lag(holiday, 2) == 1 ~ "PlusTwoDays",
                                dplyr::lag(holiday, 3) == 1 ~ "PlusThreeDays",
                                dplyr::lead(holiday, 1) == 1 ~ "MinusOneDay",
                                dplyr::lead(holiday, 2) == 1 ~ "MinusTwoDays",
                                dplyr::lead(holiday, 3) == 1 ~ "MinusThreeDays"),
         holidayLag = replace_na(holidayLag, 0))

parking_meter.panel <- parking_meter.panel %>% left_join(., temp.holiday.lag)


parking_meter.panel <- subset(parking_meter.panel,select= -c(legend))

#fill NA
parking_meter.panel[is.na(parking_meter.panel)] <- 0

#add neighborhood name
parking_meter.panel <- parking_meter.panel%>%st_sf()%>%st_transform(crs=4326)%>%
                      st_join(neighbor_sub%>%st_transform(crs=4326),
                      join=st_intersects,left = TRUE)%>%
  mutate(longitude = unlist(map(geometry, 1)),
         latitude = unlist(map(geometry, 2)))%>%
  na.omit()%>%
  st_sf()

parking_meter.panel <- subset(parking_meter.panel,select= -c(legend))

# split training and test set

parking_meter.Train <- parking_meter.panel %>% filter( week == 2 | week == 3 | week == 4)
parking_meter.Test <- parking_meter.panel %>% filter(week == 1| week == 5) 
length(parking_meter.Train$interval60) + length(parking_meter.Test$interval60)

2.2. Dependent Variables

Dependent variable, hourly occupancy of each parking space, is the what we want to predict with regression model. We split the panel into Training and Testing sets for modeling and validation. To ensure both of the training and testing sets contain a holiday, week1 and week5 are selected as the test data. Week2, week3, week4 are utilized for training.

Note that the actual parking time occupancy could be smaller than the recorded time period. Since the parking end time is recorded when drivers make a payment. In San Francisco, there are two payment methods. One is paying for the bill when drivers are leaving. The other is to estimate your parking time and pay for the parking fee in advance. Usually, drivers leave the parking space before the recorded time to avoid parking tickets when they choose the second payment method.

# parking_count serial autocorrelation
NewYear <- as.POSIXct("2018-01-01 01:00:00 UTC")
MLK <- as.POSIXct("2018-01-15 01:00:00 UTC")

st_drop_geometry(rbind(
  mutate(parking_meter.Train, Legend = "Training"), 
  mutate(parking_meter.Test, Legend = "Testing"))) %>%
  group_by(Legend, interval60) %>% 
  summarize(Ratio_Time_Occupied = sum(Ratio_Time_Occupied)) %>%
  ungroup() %>% 
  ggplot(aes(interval60, Ratio_Time_Occupied, colour = Legend)) + geom_line() +
  scale_colour_manual(values = palette2) +
  geom_vline(xintercept = NewYear, linetype = "dotted", colour = 'red',size=1) +
  geom_vline(xintercept = MLK, linetype = "dotted", colour = 'red',size=1) +
  labs(title="Parking volumes, the first five weeks of 2018",
       subtitle="Red Dotted lines for New Year's Day and Martin Luther King's Day", 
       x="Day", y="Parking Count") +
  plotTheme() + theme(panel.grid.major = element_blank())

2.3. Potential Independent Varibales

Independent variables are some features that may effects the demand of parking spaces. We divide these characteristics into five categories, including Parking Characteristics, Natural Features, Built Environment Features, Socioeconomics Features, Time Features and Spatial Features. We call these features as potential independent variables, since not all of the features listed in the table below are included in our final regression model.

Parking Characteristics: Characteristics included in the parking records, such as payment type and parking fee.
Natural Features: Weather data including percipitation, weed speed and temperature.
Built Environment Features: Distance to the nearest facilities and amenities.
Socioeconomics Features: Census data including total population, percent of poverty by tracts. Note that we also includes crime data in this part.
Time Features: One is time lag, it counts the mean parking occupancy in a recent time period. The other is holiday lag, it marks days before and after certain holiday.
Spatial Features: Neighborhood name is considered as a spatial feature.

3. Exploratory analysis

3.1. Parking Characteristics

Parking payment is decided by both of parking location and occupancy hours. The parking records is divided into 2 parts based on the median value of 1.5 dollars. As the charts demonstrate, drivers who payed less are more likely to park for additional time. It is obvious that divers prefer to pay cash when parking payment is less than 1.5 dollars. When the payment is above 1.5 dollars, count of credit card and cell phone payments boom.

meter.template %>%
    dplyr::select(Parking_Counter, PAYMENT_TYPE, METER_EVENT_TYPE) %>%na.omit()%>%
    gather(Variable, value, -Parking_Counter) %>%
    count(Variable, value, Parking_Counter) %>%
      ggplot(., aes(value, n, fill = Parking_Counter)) +   
        geom_bar(position = "dodge", stat="identity") +
        facet_wrap(~Variable, scales="free") +
        scale_fill_manual(values = palette2) +
        labs(x="y", y="Value",
             title = "Feature associations with the likelihood of taking the credit",
             subtitle = "Categorical features") +
        plotTheme()+
        theme(axis.text.x = element_text(angle = 45, hjust = 1))

3.2. Natrual Features

Weather data of San Francisco from January to February, 2018 is downloaded from Iowa Environment Mesonet using the riem package.

Intuitively, weather condition seems not to influence drivers sitting in a car. However, for the residents who are used to take a bike or public transits, they may choose to drive at a rainy or windy day, which increase the potential parking demand.

weatherDataViz <- grid.arrange(
  ggplot(weather.Panel, aes(interval60,Precipitation)) + geom_line(colour = "#ffc200",size = 0.8) + 
    labs(title="Precipitation", x="Hour", y="Precipitation") + plotTheme(),
  ggplot(weather.Panel, aes(interval60,Wind_Speed)) + geom_line(colour = "#1f948c",size = 0.8) + 
    labs(title="Wind Speed", x="Hour", y="Wind Speed") + plotTheme(),
  ggplot(weather.Panel, aes(interval60,Temperature)) + geom_line(colour = "#375b8d",size = 0.8) + 
    labs(title="Temperature", x="Hour", y="Temperature") + plotTheme(),
  top="Weather Data - San Francisco - SFO - Jan 01 - Feb 04, 2018")

3.3. Built Environmental Features

We assume that crime may influence parking choice, because people would change their parking location once their car was stolen. Theft from buildings are selected from the San Francisco Open crime data, 2018. As the map illustrates, The parking spaces in our research area is relatively safer than parking spaces within the city center. This could be a reason to encourage drivers to park outside the downtown and financial district.

ggplot() + 
  geom_sf(data = neighbor, fill = "white") +
  geom_sf(data = sf.base, fill = "transparent", size=.75) +
  geom_sf(data = meters.sub2, aes(colour = q5(theft_nn5)), show.legend = "point", size = .75, alpha=.5) + 
  scale_colour_manual(values = palette5, 
                      labels = qBr(meters.sub2, "theft_nn5"),
                      name = "Quantile\nBreaks") + 
  geom_sf(data = roads_line, size = .75, color='#666666',alpha=0.1) +
  geom_sf(data = neighbor_center,color = '#ee8138', fill = "transparent",size=1,)+
  ylim(min(dat_neighbour$latitude+0.001), max(dat_neighbour$latitude+0.001))+
  xlim(min(dat_neighbour$longitude+0.001), max(dat_neighbour$longitude+0.001))+
  labs(title = "Average Distance to the 5 Closest Theft from Vehicles, SF,2018",
       caption = "Data resource: SF OpenData") + 
  mapTheme()+
  theme(panel.background = element_rect(fill = "#d5eff5"),panel.grid.major = element_blank())

Below is a map showing the distance from each parking spaces to the closest Muni Bus Station. As we can observe, majority of parking spaces in our research area are close to a bus station. It’s practical to park at the surrounding neighborhoods and take a bus to the downtown with only one or two stops. It’s much better than cruising for a parking space in downtown at A.M. rush.

ggplot() + 
  geom_sf(data = neighbor, fill = "white") +
  geom_sf(data = sf.base, fill = "transparent", size=.75) +
  geom_sf(data = meters.sub2, aes(colour = q5(stops_nn1)), show.legend = "point", size = .75, alpha=.5) + 
  scale_colour_manual(values = palette5, 
                      labels = qBr(meters.sub2, "stops_nn1"),
                      name = "Quantile\nBreaks") + 
  geom_sf(data = roads_line, size = .75, color='#666666',alpha=0.1) +
  geom_sf(data = neighbor_center,color = '#ee8138', fill = "transparent",size=1,)+
  ylim(min(dat_neighbour$latitude+0.001), max(dat_neighbour$latitude+0.001))+
  xlim(min(dat_neighbour$longitude+0.001), max(dat_neighbour$longitude+0.001))+
  labs(title = "Distance to the Closest Bus Stops, SF,2018",
       caption = "Data resource: SF OpenData") + 
  mapTheme()+
  theme(panel.background = element_rect(fill = "#d5eff5"),panel.grid.major = element_blank())

Seeing from the series of scatter plots, the correlation between the hourly parking time occupancy and distance to the urban amenities and crimes is not significantly strong. We speculate reason is that urban facilities are relatively abundant within the scope of our study, so the distance discrepancy is not obvious among parking spaces.

# parking volumne as a function of census variables
# note: for week 4 here (faster loading)
# plotData_nn.census <- 
#   as.data.frame(parking_meter.Test2) %>%
#   filter(week == 1) %>%
#   select(POST_ID, Ratio_Time_Occupied) %>%
#   group_by(POST_ID) %>% 
#   summarize(Ratio_Time_Occupied = sum(Ratio_Time_Occupied))  %>%
#   left_join(meters.sub2_template, by=c("POST_ID" = "post_id")) %>%
#   gather(Variable, Value, -POST_ID, -Ratio_Time_Occupied) %>%
#   na.omit() %>%
#   filter(Variable =="recreation_nn1"| Variable == "theft_nn5"|Variable == "stops_nn1"|
#            Variable == "RoadDist"|Variable == "hos_nn1"|Variable == "faci_nn3") 

# correlation calculation
correlation_nn.census <-
  plotData_nn.census %>%
  group_by(Variable) %>%  
  summarize(correlation = round(cor(Value, Ratio_Time_Occupied),2))
# plot
ggplot(plotData_nn.census, aes(Value,log(Ratio_Time_Occupied))) + 
  geom_point(color="grey",alpha = 0.1) + 
  geom_smooth(method="lm", se = F,color='#ffc200',size=1) +
  facet_wrap(~Variable, scales="free", ncol=3) +
  geom_text(data=correlation_nn.census, aes(label=paste("R =", correlation)),
            colour = "black", x=-Inf, y=Inf, hjust=-0.1, vjust=1.2) +
  plotTheme() +
  labs(title="Parking Time Occupancy as a Function of Distance to Amenities and Crimes",
       subtitle= "Week 1, 2018")

3.4. Socialeconomics Features

Form the perspective of socialeconomics features, percent of people taking public transportation, total population and travel time have a strong correlation with accumulated parking time occupancy. As the scatter plots demonstrates, in census tracts where people prefer taking public transportation, the parking demand is smaller than other places. Besides, accumulated parking time occupancy is positively correlated with total population and travel time.

# parking volumne as a function of census variables
# note: for week 1 here (faster loading)
# plotData.census <- 
#   as.data.frame(parking_meter.Test2) %>%
#   filter(week == 1) %>%
#   select(GEOID, Occupancy) %>%
#   group_by(GEOID) %>% 
#   summarize(Occupancy = sum(Occupancy))  %>%
#   left_join(SFCensus18, by="GEOID") %>%
#   select(Occupancy,GEOID,Med_Inc,Total_Pop,Percent_White,Travel_Time,
#          Num_Commuters,Pct_Bachelors,Pct_Poverty,Percent_Taking_Public_Trans,
#          Med_Age, Total_Public_Trans) %>%
#   gather(Variable, Value, -GEOID, -Occupancy) %>%
#   na.omit() %>%
#   filter(Variable =="Med_Inc"| Variable == "Total_Pop"|Variable == "Percent_White"|
#            Variable == "Travel_Time"|Variable == "Percent_Taking_Public_Trans"|Variable == "Med_Age"|
#            Variable == "Pct_Bachelors"|Variable == "Pct_Poverty"|Variable == "Total_Public_Trans") 

# correlation calculation
correlation.census <-
  plotData.census %>%
  group_by(Variable) %>%  
  summarize(correlation = round(cor(Value, Occupancy),2))
# plot
ggplot(plotData.census, aes(Value,log(Occupancy))) + 
  geom_point(color="grey",alpha = 0.4) + 
  geom_smooth(method="lm", se = F,color='#ffc200') +
  facet_wrap(~Variable, scales="free", ncol=3) +
  geom_text(data=correlation.census, aes(label=paste("R =", correlation)),
            colour = "black", x=-Inf, y=Inf, hjust=-0.1, vjust=1.2) +
  plotTheme() +
  labs(title="Parking Occupancy as a function of census variables",
       subtitle= "Week 1, 2018")

3.5. Time and SpatialFeatures

Below is an animated map showing the fluctuation of hourly parking time occupancy during a weekday. There are two main parking peak periods. One is between 9 a.m. to 12 p.m. and the other is between 2 p.m. and 5 p.m.. These two periods exactly correspond to the working hours of most companies. It indicates the parking spaces near downtown San Francisco mainly meets the parking demand of office workers.

week1 <-
  filter(parking_meter.panel , week == 1 & dotw == "Tue")%>%
  select(interval60, geometry, Ratio_Time_Occupied)


ride.animation.data <-
  week1%>%
  mutate(Occupancy = case_when(Ratio_Time_Occupied == 0 ~ "0%",
                               Ratio_Time_Occupied > 0 & Ratio_Time_Occupied <= 0.5 ~ "< 50%",
                               Ratio_Time_Occupied > 0.5 & Ratio_Time_Occupied <= 0.75 ~ "50-75%",
                               Ratio_Time_Occupied > 0.75 & Ratio_Time_Occupied <=0.95 ~ "75-95%",
                               Ratio_Time_Occupied > 0.95 ~ ">95%")) %>%
  mutate(Occupancy  = factor(Occupancy, levels=c("0%", "< 50%","50-75%","75-95%",">95%")))


rideshare_animation <-
  ggplot() +
    geom_sf(data = neighbor, fill = "white") +
    geom_sf(data = sf.base, fill = "transparent", size=.75) +
    geom_sf(data = roads_line, size = .75, color='#666666',alpha=0.1) +
    geom_sf(data = ride.animation.data, aes(colour = Occupancy)) +
    scale_colour_manual(values = palette5) +
    labs(title = "Parking Time Occupancy by Meters for one day in January 2018",
         subtitle = "60 minute intervals: {current_frame}") +
    geom_sf(data = neighbor_center,color = '#ee8138', fill = "transparent",size=1,)+
    ylim(min(dat_neighbour$latitude+0.001), max(dat_neighbour$latitude+0.001))+
    xlim(min(dat_neighbour$longitude+0.001), max(dat_neighbour$longitude+0.001))+
    transition_manual(interval60) +
    mapTheme()+
    theme(panel.background = element_rect(fill = "#d5eff5"),panel.grid.major = element_blank())

animate(rideshare_animation, duration=20, renderer = gifski_renderer())

Following is a series of maps demonstrate the close relationship between parking demand and office work schedule near downtown San Francisco. Parking volume is high during the mid-day of weekdays. The volume declined around PM rush, since people are off work and drive away from the downtown.MOst parking spaces are available during weekends.

ggplot()+
  geom_sf(data = neighbor
            %>%
            st_transform(crs = 4326),color = "grey", fill = "transparent")+
  geom_point(data = dat_neighbour %>% 
               mutate(hour = hour(SESSION_START_DT),
                      weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
                      time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                              hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                              hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                              hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
               group_by(POST_ID, latitude, longitude, weekend, time_of_day) %>%
               tally(),
             aes(x=longitude, y = latitude, color = n), 
             fill = "transparent", alpha = 1, size = 1.0)+
  geom_sf(data = neighbor_center
          %>%
            st_transform(crs = 4326),color = '#666666', fill = "transparent",size=1)+
  scale_colour_viridis(direction = -1,
                       discrete = FALSE, option = "D")+
  ylim(min(dat_neighbour$latitude+0.001), max(dat_neighbour$latitude+0.001))+
  xlim(min(dat_neighbour$longitude+0.001), max(dat_neighbour$longitude+0.001))+
  facet_grid(weekend ~ time_of_day)+
  labs(title="Parking Occupancy per hour by meters. SF, 2018")+
  mapTheme

4. Modeling

4.1 Model Building

parking_meter.panel is split into a training (week2, week3, week4) and a test set week1, week5) to ensure both of the sets including holidays. To compare and find out the best regreession model, three linear models are built using the ‘lm’ function:

reg1 includes weather and time lag features.
reg2 adds the distance to urban facilities and theft as urban context effects.
reg3 adds hourly parking price and neighborhood as spatial effects.

# regression
reg1.m <- 
  lm(Ratio_Time_Occupied ~  + hour(interval60) + dotw + Temperature + Precipitation +
       lagHour + lag2Hours +lag12Hours + holidayLag + holiday, 
     data=parking_meter.Train)
reg2.m <- 
  lm(Ratio_Time_Occupied ~  + hour(interval60) + dotw + Temperature + Precipitation +
       lagHour + lag2Hours + +lag12Hours + holidayLag + holiday+
       recreation_nn1 + theft_nn5 + stops_nn1 + RoadDist + edu_nn3 + hos_nn1 + faci_nn3, 
     data=parking_meter.Train)
reg3.m <- 
  lm(Ratio_Time_Occupied ~  + hour(interval60) + dotw + Temperature + Precipitation +
       lagHour + lag2Hours + +lag12Hours + holidayLag + holiday+
        theft_nn5 + stops_nn1 + RoadDist + edu_nn3 + hos_nn1 + faci_nn3 + 
       payment_per_hour + neighborho, 
     data=parking_meter.Train)

# predict for test data
parking_meter.Test.weekNest <- 
  parking_meter.Test %>%
  nest(-week) 

week_predictions <- 
  parking_meter.Test.weekNest %>% 
  mutate(ATime_Space_FE_timeLags_holidayLags = map(.x = data, fit = reg1.m, .f = model_pred),
         BTime_Space_FE_timeLags_holidayLags_nn = map(.x = data, fit = reg2.m, .f = model_pred),
         CTime_Space_FE_timeLags_holidayLags_nn_price = map(.x = data, fit = reg3.m, .f = model_pred)) %>% 
  gather(Regression, Prediction, -data, -week) %>%
  mutate(Observed = map(data, pull, Ratio_Time_Occupied),
         Absolute_Error = map2(Observed, Prediction,  ~ abs(.x - .y)),
         MAE = map_dbl(Absolute_Error, mean, na.rm = TRUE),
         sd_AE = map_dbl(Absolute_Error, sd, na.rm = TRUE))

4.2 Result Summary

Comparing the improved regression with the baseline regression, the added features are all significant and improve the R-square from 0.531 to 0.584. The closer R-square is to 1, the stronger the model’s ability to explain the independent variables. Clearly, 0.584 is not a great number and we will evaluate the reason in the discussion part.

# Regression results output
regression_table <-tab_model(reg1.m, reg3.m, auto.label = TRUE, show.ci = FALSE, show.intercept = FALSE, 
          title = 'Summary of Training Results',
          dv.labels = c("Baseline Regression","Improved Regression"),
          string.pred = "Coefficient")

Summary of Training Results
	Baseline Regression		Improved Regression
Coefficient	Estimates	p	Estimates	p
hour(interval60)	0.00	<0.001	0.00	<0.001
dotw [linear]	0.01	<0.001	0.01	<0.001
dotw [quadratic]	-0.03	<0.001	-0.01	<0.001
dotw [cubic]	0.01	<0.001	0.01	<0.001
dotw [4th degree]	-0.01	<0.001	-0.00	<0.001
dotw [5th degree]	0.01	<0.001	0.00	<0.001
dotw [6th degree]	0.00	<0.001	0.00	<0.001
Temperature	-0.01	<0.001	-0.00	<0.001
Precipitation	0.00	<0.001	0.00	<0.001
lagHour	0.72	<0.001	0.68	<0.001
lag2Hours	-0.05	<0.001	-0.05	<0.001
lag12Hours	0.05	<0.001	0.04	<0.001
holidayLag [MinusOneDay]	0.00	<0.001	0.00	<0.001
holidayLag [MinusThreeDays]	0.01	<0.001	0.01	<0.001
holidayLag [MinusTwoDays]	0.02	<0.001	0.01	<0.001
holidayLag [PlusOneDay]	0.01	<0.001	0.01	<0.001
holidayLag [PlusThreeDays]	0.00	<0.001	0.00	0.015
holidayLag [PlusTwoDays]	0.02	<0.001	0.01	<0.001
holiday	-0.00	<0.001	-0.00	0.016
theft_nn5			1.16	<0.001
stops_nn1			2.32	<0.001
RoadDist			0.71	0.010
edu_nn3			-0.19	0.003
hos_nn1			-0.12	<0.001
faci_nn3			1.76	<0.001
payment_per_hour			0.10	<0.001
neighborho [Nob Hill]			0.01	<0.001
neighborho [North Beach]			0.01	<0.001
neighborho [South of Market]			0.00	0.572
neighborho [Western Addition]			0.00	<0.001
Observations	5087470		5087470
R² / R² adjusted	0.531 / 0.531		0.584 / 0.584

Table 3.1

5. Validation

5.1. Examine Error Metrics for Accuracy

Mean Absolute Error(MAE) is absolute difference between observed and predicted value. Comparing the MAE results of three regression models, distance to the urban facilities doesn’t improve much of the accuracy within our research area. In the third regression model, hourly parking price and neighborhood features significantly improved the accuracy. This bar chart also indicate that prediction accuracy of week1 is better than that of week5. The potential reasons will be discussed following.

# MAE
week_predictions %>%
  dplyr::select(week, Regression, MAE) %>%
  gather(Variable, MAE, -Regression, -week) %>%
  ggplot(aes(week, MAE)) + 
  geom_bar(aes(fill = Regression), position = "dodge", stat="identity", width=2) +
  scale_fill_manual(values = palette3) +
  labs(title = "Mean Absolute Errors by model specification and week") +
  plotTheme()

Below is a line chart comparing the observed and predicted parking value by time series. It indicates that our final model performs great when predicting the low point and under-predicts the peak moments. The observed parking volume on Saturday, week5 is higher than normal situation. We look up San Francisco events and find an Independent Film Festival on that day. This also explains why the MAE of the week5 is higher than week1.

# plot predict/observed time series
week_predictions %>% 
  mutate(interval60 = map(data, pull, interval60),
         POST_ID = map(data, pull, POST_ID)) %>%
  dplyr::select(interval60, POST_ID, Observed, Prediction, Regression) %>%
  unnest() %>%
  gather(Variable, Value, -Regression, -interval60, -POST_ID) %>%
  group_by(Regression, Variable, interval60) %>%
  summarize(Value = sum(Value,na.rm=TRUE)) %>%
  ggplot(aes(interval60, Value, colour=Variable)) + 
  scale_colour_manual("",values = c("Observed" = "darkgray","Prediction" = "#ffc200"))+
  geom_line(size = 1.1) + 
  facet_wrap(~Regression, ncol=1) +
  labs(title = "Predicted/Observed parking volume time series", 
       subtitle = "San Francisco; A test set of 1 week",  x = "Hour", y= "Parking Volume") +
  plotTheme()

From a spatial scope, our model performs better accuracy on parking spaces close to the Downtown and Financial District.

# validate test set by space
# error_by_week_space <- week_predictions %>% 
#   mutate(interval60 = map(data, pull, interval60),
#          POST_ID = map(data, pull, POST_ID), 
#          latitude = map(data, pull, latitude), 
#          longitude = map(data, pull, longitude)) %>%
#   select(interval60, POST_ID, latitude, longitude, Observed, Prediction, Regression) %>%
#   unnest() %>%
#   filter(Regression == "CTime_Space_FE_timeLags_holidayLags_nn_price") %>%
#   group_by(POST_ID, latitude, longitude) %>%
#   summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))

error_by_week_space %>%
  ggplot(.)+
  geom_sf(data = neighbor, fill = "white") +
    geom_sf(data = sf.base, fill = "transparent", size=.75) +
    geom_sf(data = roads_line, size = .75, color='#666666',alpha=0.1) +
  geom_point(aes(x = longitude, y = latitude, color = MAE), 
             fill = "transparent", alpha = 1, size=1)+
  scale_colour_viridis(direction = -1,
                       discrete = FALSE, option = "D")+
  geom_sf(data = neighbor_center,color = '#ee8138', fill = "transparent",size=1,)+
  labs(title="Mean Abs Error, Test Set")+
  ylim(min(dat_neighbour$latitude+0.001), max(dat_neighbour$latitude+0.001))+
  xlim(min(dat_neighbour$longitude+0.001), max(dat_neighbour$longitude+0.001))+
  mapTheme()+
  theme(panel.background = element_rect(fill = "#d5eff5"),panel.grid.major = element_blank())

5.2. K-Fold Cross Validation

Limited by the computer performance, we randomly select 800 thousand of rows from the study panel and ran 10 Fold regression. Cross-validation is a resampling process, which shuffles data randomly. The dataset will be firstly split into 100 groups, and one group will be taken as a holdout, and the remaining will become a training dataset. Then, the model is fitted in the training set and will be tested against the test set. As we can see, the CV MAE is close to MAE of previous prediction result, which indicate the accuracy is generalizable.

parking_sample <- parking_meter.panel[sample(nrow(parking_meter.panel), 800000), ]
fitControl <- trainControl(method = "cv", number = 10)
set.seed(825)

reg.cv <- train(Ratio_Time_Occupied ~  + hour(interval60) + dotw + Temperature + Precipitation +
                lagHour + lag2Hours + +lag12Hours + holidayLag + holiday + theft_nn5 + stops_nn1 +
                  RoadDist + edu_nn3 + hos_nn1 + faci_nn3 + payment_per_hour, 
                  data=parking_sample,
                 method = "lm", trControl = fitControl, na.action = na.pass)

reg.cv.results <-reg.cv$resample

The Root Mean Squared Error (RMSE) is 0.2066965 which is the standard deviation of the absolute error. The distribution of the standard deviation of MAE is shown in the histogram below. The range of MAE is from 0.0765 to 0.0804, which is a small interval.

ggplot(as.data.frame(reg.cv.results), aes(MAE)) + 
  geom_histogram(bins = 50, colour="white", fill = "#1f948c") +
  labs(title="Regression C", subtitle = "k-fold cross validation; k = 10",
       x="Mean Absolute Error (MAE)", y="Count") +
  plotTheme()

5.3. Generalizability

Generalizability is tested based on the scope of time and space. From the perspective of time, as the series of scatter plots demonstrates, under-predicting problem consistently exists in different time interval .

# space time error evaluation
error_space_time <- week_predictions %>% 
  mutate(interval60 = map(data, pull, interval60),
         POST_ID = map(data, pull, POST_ID), 
         latitude = map(data, pull, latitude), 
         longitude = map(data, pull, longitude),
         dotw = map(data, pull, dotw)) %>%
  select(interval60, POST_ID, longitude, 
         latitude, Observed, Prediction, Regression,
         dotw) %>%
  unnest() %>%
  filter(Regression == "CTime_Space_FE_timeLags_holidayLags_nn_price")%>%
  mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
         time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))
error_space_time %>%
  ggplot()+
  geom_point(aes(x= Observed, y = Prediction), alpha=0.5, colour = "grey")+
  geom_smooth(aes(x= Observed, y= Prediction), method = "lm", se = FALSE, color = "#ffc200")+
  geom_abline(slope = 1, intercept = 0)+
  facet_grid(time_of_day~weekend)+
  labs(title="Observed vs Predicted",
       x="Observed trips", 
       y="Predicted trips")+
  ylim(0,25)+
  plotTheme()

Generally, the model performs great when predicting AM Rush and Overnight, which are also the valley of parking volume. For peak hours such as Mid-day and PM Rush, the model underpredicts the peak value and MAE is relatively greater. From a spatial perspective, Our model has better prediction ability for parking spaces closeer to the center area.

error_space_time %>%
  group_by(POST_ID,time_of_day, latitude, longitude) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
  ggplot(.)+
  geom_sf(data = neighbor, fill = "white") +
  geom_sf(data = sf.base, fill = "transparent", size=.75) +
  geom_sf(data = roads_line, size = .75, color='#666666',alpha=0.1) +
  geom_point(aes(x = longitude, y = latitude, color = MAE), 
             fill = "transparent", alpha = 1, size=1)+
  scale_colour_viridis(direction = -1,
                       discrete = FALSE, option = "D")+
  geom_sf(data = neighbor_center,color = '#ee8138', fill = "transparent",size=1,)+
  ylim(min(dat_neighbour$latitude+0.001), max(dat_neighbour$latitude+0.001))+
  xlim(min(dat_neighbour$longitude+0.001), max(dat_neighbour$longitude+0.001))+
  facet_grid(~time_of_day)+
  labs(title="Mean Absolute Errors by Time Period, Test Set")+
  mapTheme()+
  theme(panel.background = element_rect(fill = "#d5eff5"),panel.grid.major = element_blank())

6. Discussion

6.1. Advantages and limitations of Prediction Model

The advantages of our model are good scale precision and accuracy. In our study, we predict the hourly occupancy of every single parking spaces. The high scale precision brings great commercial application value. It could predict potential available parking space during a certain interval and inform the drivers. Besides the accruracy of model is also reliable. The MAE of cross validation is from 0.0765 to 0.0804, which demonstrate generalizable and high accuracy.

However, the limitation of the regression model is also obvious. First is the low R-Square (0.58).It means that only 58% of the variance for hourly parking time occupancy is explained by independent variables. One potential reason is the skewed dependent variable. There are too much zero values, since the parking spaces are occupied during work time and are empty in most cases.To deal with this problem, we will try other regression models, such as Zero-Inflated Poisson in future exploration.

Second is under-pediction. We’ll try to add new features, such as the distance to the office and Daily Events in San Francisco, to help the model better capture the peak.

Last is time precision. Limited by computer performance, the time precision of our model is one hour. It’s not enough for the real application. Therefore, we consider improve the precision to 15min and run a larger data on a cloud server.

6.2. Applied Scenarios

Aggregate history/input data & model outputs at different levels

General - tract/neighborhood level
- Adjustment regional price and planning - not just dynamic pricing For managers it helps to control the parking prices across tracts, some hints on integrated planning and the dispatch of enforcement power .
Individual meter level
- We predict for the Occupancy Time% of individual parking meters and also offer concurrent searching demand per user’s request. For drivers and homeowners, the supply and demand for each parking meter could help them to decide their travel and loan plans.
Absorption and analysis of user search requests (Idea - not integrated into model yet)
- We know there may be an unoccupied parking space. But what about 100 users are all look at it? We need to take the information from user requests into account.

Model Errors are also precious

Individual decision making: Deal? Cruising? Waiting in line?
- predicted occupancy per meter +/- mode errors at the meter, with reference to the mean absolute error, can provide the user a clear vision of chances that there will be a space at the parking meter by the time of arrival
  - E.g., a parking meter with predicted occupancy time% of [0.2 +/- 0.13] may offer the user a better chance of finding an occupied space than one showing [0.8 +/- 0.21]