Geospatial Risk Modeling - Chicago Public Safety: Theft from Building

Many cities are faced with similar strategic planning issues recently with the advanced techiniques to do with big data analytics and machine learning: paying money for predictions to reduce losses from crime incidents in the future? This project built a geospatial risk modeling for the public safety policy of Chicago, Illinois using the crime type Theft from Building data and available 311 request records dataset retrieved from Chicago City Data Portal: https://data.cityofchicago.org/.

Libraries

See the hiden code block for libraries and functions used in this project.

library(tidyverse)
library(sf)
library(RSocrata)
library(viridis)
library(spatstat)
library(raster)
library(spdep)
library(FNN)
library(grid)
library(gridExtra)
library(knitr)
library(kableExtra)
library(tidycensus)

#root.dir = "https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/DATA/";
#source("https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/functions.r")

Data Wrangling

We focused on crime incidents classified as Theft from Building in 2018 for model building. Data access and cleaning are enabled by sf and Socrata.

policeDistricts <- 
  st_read("https://data.cityofchicago.org/api/geospatial/fthy-xz3r?method=export&format=GeoJSON") %>%
  st_transform('ESRI:102271') %>%
  dplyr::select(District = dist_num)
  
policeBeats <- 
  st_read("https://data.cityofchicago.org/api/geospatial/aerh-rz74?method=export&format=GeoJSON") %>%
  st_transform('ESRI:102271') %>%
  dplyr::select(District = beat_num)

bothPoliceUnits <- rbind(mutate(policeDistricts, Legend = "Police Districts"), 
                         mutate(policeBeats, Legend = "Police Beats"))

theftBldg18 <- 
  read.socrata("https://data.cityofchicago.org/Public-Safety/Crimes-2018/3i3m-jwuy") %>% 
    filter(Primary.Type == "THEFT" & Description == "FROM BUILDING")%>%
    mutate(x = gsub("[()]", "", Location)) %>%
    separate(x,into= c("Y","X"), sep=",") %>%
    mutate(X = as.numeric(X),Y = as.numeric(Y))%>% 
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant")%>%
    st_transform('ESRI:102271') %>% 
    distinct()

chicagoBoundary <- 
  st_read(file.path(root.dir,"/Chapter5/chicagoBoundary.geojson")) %>%
  st_transform('ESRI:102271')

Plotting Theft from Building 2018 data and density

# uses grid.arrange to organize indpendent plots
grid.arrange(ncol=2,
ggplot() + 
  geom_sf(data = chicagoBoundary) +
  geom_sf(data = theftBldg18, colour="red", size=0.1, show.legend = "point") +
  labs(title= "Theft from Building, Chicago - 2018") +
  mapTheme(title_size = 14),

ggplot() + 
  geom_sf(data = chicagoBoundary, fill = "grey40") +
  stat_density2d(data = data.frame(st_coordinates(theftBldg18)), 
                 aes(X, Y, fill = ..level.., alpha = ..level..),
                 size = 0.01, bins = 40, geom = 'polygon') +
  scale_fill_viridis(option = "C")  + #plasma
  scale_alpha(range = c(0.00, 0.35), guide = FALSE) +
  labs(title = "Density of Thefts from Building") +
  mapTheme(title_size = 14) + theme(legend.position = "none"))

Figure 1 | Point and density patterns of theft from building data, Chicago, 2018

It seemed that in 2018, Theft from Building incidents were highly concentrated on the "Loop" neighborhoods of chicago along the lake shore, and gradually downscaled in density radiating to the other regions within the Chicago city boundary. But from the point data patterns we could say that the problem of theft from building was common across the city area. We did not quite know if there existed difference in incidents reporting processes and efficiency between the more populous regions and the more sparse parts. Is this a ratio-and-scale problem? In other words, was the uneven distribution of theft from building incidents caused by the differences in the density of residents and business activities? We also might refer to the "broken window theory" that the extent of local disorderance was connected with the density of local crime incidents.

In this project, a geospatial risk predictive model of theft from building is created. We began by wrangling theft from building and risk factor data into geospatial features, correlating their exposure, and estimating models to predict theft from building latent risk. These models were then validated in part, by comparing predictions to a kernel density measure of geospatial crime risk.

Creating a fishnet grid

The crime_net includes counts of the dependent variable, theft from building. Next, a set of risk factor features are downloaded and wrangled to the fishnet.

## using {sf} to create the grid
## Note the `.[chicagoBoundary] %>% ` line. This is needed to clip the grid to our data
fishnet <- 
  st_make_grid(chicagoBoundary,
               cellsize = 500, 
               square = TRUE) %>%
  .[chicagoBoundary] %>% 
  st_sf() %>%
  mutate(uniqueID = rownames(.))

Aggregate points to the fishnet

## add a value of 1 to each crime, sum them with aggregate
crime_net <- 
  dplyr::select(theftBldg18) %>% 
  mutate(countTheftBldg18 = 1) %>% 
  aggregate(., fishnet, sum) %>%
  mutate(countTheftBldg18 = replace_na(countTheftBldg18, 0),
         uniqueID = rownames(.),
         cvID = sample(round(nrow(fishnet) / 24), 
                       size=nrow(fishnet), replace = TRUE))

ggplot() +
  geom_sf(data = crime_net, aes(fill = countTheftBldg18), color = NA) +
  scale_fill_viridis(option = "C")  + #plasma
  labs(title = "Count of Theft from Buildings for the fishnet") +
  mapTheme()

$Figure 2 | Theft from building incidents aggregated to the fishnet$

Figure 2 | Theft from building incidents aggregated to the fishnet

The figure above mapped the count of thefts from building by grid cell, and the clustered spatial process of theft from building began to take shape. Still, the loop region, or the central business disctrict of Chicago observed a pattern of clustering in terms of theft from building incidents.

Modeling Spatial Features

This chunk of codes read in risk factors using RSocrata package from Chicago data portal.

Totally twelve risk factors are downloaded, including 311 reports of abandoned cars, sanitation complaints, rodent baiting, street lights out, tree trim, tree debris, garbage carts, pot holes, graffiti remediation, along with a neighborhood polygon layer, a police station point layer, and the location of retail stores that sell liquor as well as retail stores that sell tobacco to go.

abandonCars <- 
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Abandoned-Vehicles/3c9v-pnva") %>%
    mutate(year = substr(creation_date,1,4)) %>% filter(year == "2018") %>%
    dplyr::select(Y = latitude, X = longitude) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "Abandoned_Cars")

sanitationComplaints <- 
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Sanitation-Code-Complaints-Hi/me59-5fac") %>%
    mutate(year = substr(creation_date,1,4)) %>% filter(year == "2018") %>%
    dplyr::select(Y = latitude, X = longitude) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "Sanitation_Complaints")

rodentBaiting <- 
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Rodent-Baiting-Historical/97t6-zrhs") %>%
    mutate(year = substr(creation_date,1,4)) %>% filter(year == "2018") %>%
    dplyr::select(Y = latitude, X = longitude) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "Rodent_Baiting")

stLightsOut <- 
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Street-Lights-All-Out-Histori/zuxi-7xem") %>%
    mutate(year = substr(creation_date,1,4)) %>% filter(year == "2018") %>%
    dplyr::select(Y = latitude, X = longitude) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "Steet_Lights_All_Out")

treeTrim <- 
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Tree-Trims-Historical/uxic-zsuj") %>%
    mutate(year = substr(creation_date,1,4)) %>% filter(year == "2018") %>%
    dplyr::select(Y = latitude, X = longitude) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "Tree_Trim")

treeDebris <- 
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Tree-Debris-Historical/mab8-y9h3") %>%
    mutate(year = substr(creation_date,1,4)) %>% filter(year == "2018") %>%
    dplyr::select(Y = latitude, X = longitude) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "Tree_Debris")

garbageCarts <- 
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Garbage-Carts-Historical/9ksk-na4q") %>%
    mutate(year = substr(creation_date,1,4)) %>% filter(year == "2018") %>%
    dplyr::select(Y = latitude, X = longitude) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "Garbage_Carts")

potHoles <- 
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Pot-Holes-Reported-Historical/7as2-ds3y") %>%
    mutate(year = substr(creation_date,1,4)) %>% filter(year == "2018") %>%
    dplyr::select(Y = latitude, X = longitude) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "Pot_Holes")

tobaccoRetail <- 
  read.socrata("https://data.cityofchicago.org/Community-Economic-Development/Tobacco/98qj-ah7k") %>%
  filter(BUSINESS.ACTIVITY == "Retail Sale of Tobacco") %>%
  dplyr::select(Y = LATITUDE, X = LONGITUDE) %>%
  na.omit() %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
  st_transform(st_crs(fishnet)) %>%
  mutate(Legend = "Tobacco_Retail")

liquorRetail <- 
  read.socrata("https://data.cityofchicago.org/Community-Economic-Development/Liquor-Retail/4py5-yxxu") %>%
  filter(BUSINESS.ACTIVITY == "Retail Sales of Packaged Liquor") %>%
  dplyr::select(Y = LATITUDE, X = LONGITUDE) %>%
  na.omit() %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
  st_transform(st_crs(fishnet)) %>%
  mutate(Legend = "Liquor_Retail")

graffiti <- 
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Graffiti-Removal-Historical/hec5-y4x5") %>%
  mutate(year = substr(creation_date,1,4)) %>% filter(year == "2018") %>%
  filter(where_is_the_graffiti_located_ == "Front" |
           where_is_the_graffiti_located_ == "Rear" | where_is_the_graffiti_located_ == "Side") %>%
  dplyr::select(Y = latitude, X = longitude) %>%
  na.omit() %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
  st_transform(st_crs(fishnet)) %>%
  mutate(Legend = "Graffiti")

policeStations <- 
  read.socrata("https://data.cityofchicago.org/Public-Safety/Police-Stations/z8bn-74gv/data") %>%
  dplyr::select(Y = latitude, X = longitude) %>%
  na.omit() %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
  st_transform(st_crs(fishnet)) %>%
  mutate(Legend = "Police_Stations")

## Neighborhoods to use in LOOCV in a bit
neighborhoods <- 
  st_read("https://raw.githubusercontent.com/blackmad/neighborhoods/master/chicago.geojson") %>%
  st_transform(st_crs(fishnet))

Aggregate features to the fishnet

This code chunk aggregated the retrieved features to the existing fishnet.

vars_net <- abandonCars %>%
  st_join(., fishnet, join=st_within) %>%
  st_drop_geometry() %>%
  group_by(uniqueID, Legend) %>%
  summarize(count = n()) %>%
  full_join(fishnet, by = "uniqueID") %>%
  spread(Legend, count, fill=0) %>%
  st_sf() %>%
  dplyr::select(-`<NA>`) %>%
  na.omit() %>%
  ungroup()

vars_net <- garbageCarts %>% st_join(., fishnet, join=st_within) %>% st_drop_geometry() %>% group_by(uniqueID, Legend) %>%
  summarize(count = n()) %>% full_join(vars_net, by = "uniqueID") %>% spread(Legend, count, fill=0) %>% st_sf() %>%
  dplyr::select(-`<NA>`) %>% na.omit() %>% ungroup()

vars_net <- graffiti %>% st_join(., fishnet, join=st_within) %>% st_drop_geometry() %>% group_by(uniqueID, Legend) %>%
  summarize(count = n()) %>% full_join(vars_net, by = "uniqueID") %>% spread(Legend, count, fill=0) %>% st_sf() %>%
  dplyr::select(-`<NA>`) %>% na.omit() %>% ungroup()

vars_net <- liquorRetail %>% st_join(., fishnet, join=st_within) %>% st_drop_geometry() %>% group_by(uniqueID, Legend) %>%
  summarize(count = n()) %>% full_join(vars_net, by = "uniqueID") %>% spread(Legend, count, fill=0) %>% st_sf() %>%
  dplyr::select(-`<NA>`) %>% na.omit() %>% ungroup()

vars_net <- policeStations %>% st_join(., fishnet, join=st_within) %>% st_drop_geometry() %>% group_by(uniqueID, Legend) %>%
  summarize(count = n()) %>% full_join(vars_net, by = "uniqueID") %>% spread(Legend, count, fill=0) %>% st_sf() %>%
  dplyr::select(-`<NA>`) %>% na.omit() %>% ungroup()

vars_net <- potHoles %>% st_join(., fishnet, join=st_within) %>% st_drop_geometry() %>% group_by(uniqueID, Legend) %>%
  summarize(count = n()) %>% full_join(vars_net, by = "uniqueID") %>% spread(Legend, count, fill=0) %>% st_sf() %>%
  dplyr::select(-`<NA>`) %>% na.omit() %>% ungroup()

vars_net <- rodentBaiting %>% st_join(., fishnet, join=st_within) %>% st_drop_geometry() %>% group_by(uniqueID, Legend) %>%
  summarize(count = n()) %>% full_join(vars_net, by = "uniqueID") %>% spread(Legend, count, fill=0) %>% st_sf() %>%
  dplyr::select(-`<NA>`) %>% na.omit() %>% ungroup()

vars_net <- sanitationComplaints %>% st_join(., fishnet, join=st_within) %>% st_drop_geometry() %>% group_by(uniqueID, Legend) %>%
  summarize(count = n()) %>% full_join(vars_net, by = "uniqueID") %>% spread(Legend, count, fill=0) %>% st_sf() %>%
  dplyr::select(-`<NA>`) %>% na.omit() %>% ungroup()

vars_net <- stLightsOut %>% st_join(., fishnet, join=st_within) %>% st_drop_geometry() %>% group_by(uniqueID, Legend) %>%
  summarize(count = n()) %>% full_join(vars_net, by = "uniqueID") %>% spread(Legend, count, fill=0) %>% st_sf() %>%
  dplyr::select(-`<NA>`) %>% na.omit() %>% ungroup()

vars_net <-tobaccoRetail %>% st_join(., fishnet, join=st_within) %>% st_drop_geometry() %>% group_by(uniqueID, Legend) %>%
  summarize(count = n()) %>% full_join(vars_net, by = "uniqueID") %>% spread(Legend, count, fill=0) %>% st_sf() %>%
  dplyr::select(-`<NA>`) %>% na.omit() %>% ungroup()

vars_net <-treeDebris %>% st_join(., fishnet, join=st_within) %>% st_drop_geometry() %>% group_by(uniqueID, Legend) %>%
  summarize(count = n()) %>% full_join(vars_net, by = "uniqueID") %>% spread(Legend, count, fill=0) %>% st_sf() %>%
  dplyr::select(-`<NA>`) %>% na.omit() %>% ungroup()

vars_net <-treeTrim %>% st_join(., fishnet, join=st_within) %>% st_drop_geometry() %>% group_by(uniqueID, Legend) %>%
  summarize(count = n()) %>% full_join(vars_net, by = "uniqueID") %>% spread(Legend, count, fill=0) %>% st_sf() %>%
  dplyr::select(-`<NA>`) %>% na.omit() %>% ungroup()

Risk Factors by Grid

vars_net.long <- 
  gather(vars_net, Variable, value, -geometry, -uniqueID)

vars <- unique(vars_net.long$Variable)
mapList <- list()

for(i in vars){
  mapList[[i]] <- 
    ggplot() +
      geom_sf(data = filter(vars_net.long, Variable == i), aes(fill=value), colour=NA) +
      scale_fill_viridis(name="", option = 'C') +
      labs(title=i) +
      mapTheme()}

# do.call(grid.arrange,c(mapList, ncol = 3))

Figure 3.1 | Risk factors by fishnet

These risk factors illustrate various spatial processes.

Abandoned_Cars were mainly clustered in North and West Chicago, and similar patterns were seen in 311 reports for Pot_Holes and Rodent_Baiting. 311 complaints for Graffiti tended to cluster along major thoroughfares and both Tobacco_Retail and Liquor_Retail were heavily clustered in "The Loop", Chicago's downtown and main business region. Factors like Garbage_Carts, Sanitation_Complaints, Steet_Lights_All_Out, Tree_Debris permeated into the city. Police_Stations sparsely distributed into the neighborhoods of Chicago city.

Nearest Neighbor Features

# convinience to reduce length of function names.
st_c <- st_coordinates
st_coid <- st_centroid
k_num = 3

## create NN from risk factors
vars_net <- vars_net %>%
    mutate(Abandoned_Cars.nn = nn_function(st_c(st_coid(vars_net)), 
                                           st_c(abandonCars),
                                           k = k_num)) %>%
  mutate(Garbage_Carts.nn = nn_function(st_c(st_coid(vars_net)),
                                          st_c(garbageCarts),
                                          k = k_num)) %>%
  mutate(Graffiti.nn = nn_function(st_c(st_coid(vars_net)),
                                          st_c(graffiti),
                                          k = k_num)) %>%
  mutate(Liquor_Retail.nn = nn_function(st_c(st_coid(vars_net)),
                                          st_c(liquorRetail),
                                          k = k_num)) %>%
  mutate(Police_Stations.nn = nn_function(st_c(st_coid(vars_net)),
                                          st_c(policeStations),
                                          k = k_num)) %>%
  mutate(Pot_Holes.nn = nn_function(st_c(st_coid(vars_net)),
                                          st_c(potHoles),
                                          k = k_num)) %>%
  mutate(Rodent_Baiting.nn = nn_function(st_c(st_coid(vars_net)),
                                          st_c(rodentBaiting),
                                          k = k_num)) %>%
  mutate(Sanitation_Complaints.nn = nn_function(st_c(st_coid(vars_net)),
                                          st_c(sanitationComplaints),
                                          k = k_num)) %>%
  mutate(Street_Lights_All_Out.nn = nn_function(st_c(st_coid(vars_net)),
                                          st_c(stLightsOut),
                                          k = k_num)) %>%
  mutate(Tobacco_Retail.nn = nn_function(st_c(st_coid(vars_net)),
                                          st_c(tobaccoRetail),
                                          k = k_num)) %>%
    mutate(Tree_Debris.nn = nn_function(st_c(st_coid(vars_net)),
                                          st_c(treeDebris),
                                          k = k_num)) %>%
    mutate(Tree_Trim.nn = nn_function(st_c(st_coid(vars_net)),
                                          st_c(treeTrim),
                                          k = k_num))
## Visualize the NN feature
vars_net.long.nn <- 
  dplyr::select(vars_net, ends_with(".nn")) %>%
    gather(Variable, value, -geometry)

vars <- unique(vars_net.long.nn$Variable)
mapList <- list()

for(i in vars){
  mapList[[i]] <- 
    ggplot() +
      geom_sf(data = filter(vars_net.long.nn, Variable == i), aes(fill = value), colour=NA) +
      scale_fill_viridis(name="", option = 'C') +
      labs(title=i) +
      mapTheme()}
# do.call(grid.arrange,c(mapList,  ncol = 3))

Figure 3.2 | Nearest Neighbor Risk factors by fishnet

Average nearest neighbor features were created by converting vars_net grid cells to centroid points then measuring to 3 risk factor points. This helped to understand the adjacency and spatial relationships between every grid and each risk factor.

Join NN feature to our fishnet

Since the counts were aggregated to each cell by uniqueID we can use that to join the counts to the fishnet.

# Measure distance to one point - Loop - Chicago's Central Business District
loopPoint <- filter(neighborhoods, name == "Loop") %>%
  st_centroid()

vars_net$loopDistance = 
  st_distance(st_centroid(vars_net), loopPoint) %>%
  as.numeric()

## important to drop the geometry from joining features
final_net <-
  left_join(crime_net, st_drop_geometry(vars_net), by="uniqueID")

Join in areal data

Using spatial joins to join centroids of fishnets to polygon for neighborhoods and districts.

final_net <-
  st_centroid(final_net) %>%
    st_join(dplyr::select(neighborhoods, name), by = "uniqueID") %>%
    st_join(dplyr::select(policeDistricts, District), by = "uniqueID") %>%
      st_drop_geometry() %>%
      left_join(dplyr::select(final_net, geometry, uniqueID)) %>%
      st_sf() %>%
  na.omit()

Exploring the spatial process of theft from building

Local Moran's I for fishnet grid cells

## generates warnings from PROJ issues
## {spdep} to make polygon to neighborhoods... 
final_net.nb <- poly2nb(as_Spatial(final_net), queen=TRUE)
## ... and neighborhoods to list of weights
final_net.weights <- nb2listw(final_net.nb, style="W", zero.policy=TRUE)

## see ?localmoran
local_morans <- localmoran(final_net$countTheftBldg18, final_net.weights) %>%
  as.data.frame()

# join local Moran's I results to fishnet
final_net.localMorans <- 
  cbind(local_morans, 
        as.data.frame(final_net)) %>% 
  st_sf() %>%
  dplyr::select(Theft_From_Bldg_Count = countTheftBldg18, 
                Local_Morans_I = Ii, 
                P_Value = `Pr(z > 0)`) %>%
  mutate(Significant_Hotspots = ifelse(P_Value <= 0.001, 1, 0)) %>%
  gather(Variable, Value, -geometry)

vars <- unique(final_net.localMorans$Variable)
varList <- list()

for(i in vars){
  varList[[i]] <- 
    ggplot() +
      geom_sf(data = filter(final_net.localMorans, Variable == i), 
              aes(fill = Value), colour=NA) +
      scale_fill_viridis(name="", option = 'C') +
      labs(title=i) +
      mapTheme(title_size = 14) + theme(legend.position="bottom")}

do.call(grid.arrange,c(varList, ncol = 4, top = "Local Morans I statistics, Theft From Building"))

$Figure 4 | Local Morans I statistics - theft from building$

Figure 4 | Local Morans I statistics - theft from building

Figure 4 described the local spatial process of theft from building, in which relatively high values of Local_Morans_I represented strong and statistically significant evidence of local clustering. Chicago downtown region featured high counts of theft from building incidents, relatively high I values, low p-values and a series of significant hotspots. The outcomes indicated that there might existed local clustring in that region.

Correlations

# significant cluster
final_net <-
  final_net %>%
  mutate(theftBldg18.isSig = 
           ifelse(local_morans[,5] <=0.0000001, 1, 0)) %>%
  mutate(theftBldg18.isSig.dist =
           nn_function(st_c(st_coid(final_net)),
                       st_c(st_coid(
                         filter(final_net, theftBldg18.isSig == 1))), 1))
correlation.long <-
  st_drop_geometry(final_net) %>%
    dplyr::select(-uniqueID, -cvID, -loopDistance, -name, -District) %>%
    gather(Variable, Value, -countTheftBldg18)

correlation.cor <-
  correlation.long %>%
    group_by(Variable) %>%
    summarize(correlation = cor(Value, countTheftBldg18, use = "complete.obs"))

ggplot(correlation.long, aes(Value, countTheftBldg18)) +
  geom_point(size = 0.1) +
  geom_text(data = correlation.cor, aes(label = paste("r =", round(correlation, 2))),
            x=-Inf, y=Inf, vjust = 1.5, hjust = -.1) +
  geom_smooth(method = "lm", se = FALSE, colour = "#ee8138") +
  facet_wrap(~Variable, ncol = 2, scales = "free") +
  plotTheme()

Figure 5 | Counts of theft and risk factors correlations

Figure 5 organized count and nearest neighbor (nn) correlations side-by-side. Liquor_Retail, Tobacco_Retail, and countTheftBldg18.isSig had comparatively higher correlations with countTheftBldg18 (r > 0.6) out of all features. Features with correlation coefficient of about 0 might be removed from the model, for example, Garbage_Carts.

Accuracy & Generalizability

Histogram - distributions of theft from building

final_net %>%
  ggplot(aes(countTheftBldg18)) +
    geom_histogram(bins = 50, colour = "black", fill = "#FED725FF") +
    labs(title = 'Distribution of theft from building', subtitle = "crime incident counts",
         x = "Theft from Building", y = "Count")

Figure 6 | Histogram - Theft from Building

The right-skewed shape of the distribution of countTheftBldg18 indicated that a Poisson regression model could better fit the data instead of an OLS model. After all, it was not unreasonable for most cells to contain quite small or zero counts of theft.

Cross Validation

This chunk of code went over the cross validation process for both datasets with or without spatial consideration (theftBldg18.isSig).

reg.vars <- c("Abandoned_Cars.nn", "Garbage_Carts.nn", "Graffiti.nn", "Liquor_Retail.nn",
              "Police_Stations.nn", "Pot_Holes.nn", "Rodent_Baiting.nn", "Sanitation_Complaints.nn",
              "Street_Lights_All_Out.nn", "Tobacco_Retail.nn", "Tree_Debris.nn", "Tree_Trim.nn", 
              "loopDistance")

reg.ss.vars <- c("Abandoned_Cars.nn", "Garbage_Carts.nn", "Graffiti.nn", "Liquor_Retail.nn",
              "Police_Stations.nn", "Pot_Holes.nn", "Rodent_Baiting.nn", "Sanitation_Complaints.nn",
              "Street_Lights_All_Out.nn", "Tobacco_Retail.nn", "Tree_Debris.nn", "Tree_Trim.nn", 
              "loopDistance", "theftBldg18.isSig", "theftBldg18.isSig.dist")

crossValidate <- function(dataset, id, dependentVariable, indVariables) {
  
  allPredictions <- data.frame()
  cvID_list <- unique(dataset[[id]])
  
  for (i in cvID_list) {
    
    thisFold <- i
    cat("This hold out fold is", thisFold, "\n")
    
    fold.train <- filter(dataset, dataset[[id]] != thisFold) %>% as.data.frame() %>% 
      dplyr::select(id, geometry, indVariables, dependentVariable)
    fold.test  <- filter(dataset, dataset[[id]] == thisFold) %>% as.data.frame() %>% 
      dplyr::select(id, geometry, indVariables, dependentVariable)
    
    regression <-
      glm(countTheftBldg18 ~ ., family = "poisson", 
          data = fold.train %>% 
            dplyr::select(-geometry, -id))
    
    thisPrediction <- 
      mutate(fold.test, Prediction = predict(regression, fold.test, type = "response"))
    
    allPredictions <-
      rbind(allPredictions, thisPrediction)
    
  }
  return(st_sf(allPredictions))
}  

reg.cv <- crossValidate(
  dataset = final_net,
  id = "cvID",
  dependentVariable = "countTheftBldg18",
  indVariables = reg.vars) %>%
    dplyr::select(cvID = cvID, countTheftBldg18, Prediction, geometry)

reg.ss.cv <- crossValidate(
  dataset = final_net,
  id = "cvID",
  dependentVariable = "countTheftBldg18",
  indVariables = reg.ss.vars) %>%
    dplyr::select(cvID = cvID, countTheftBldg18, Prediction, geometry)

reg.spatialCV <- crossValidate(
  dataset = final_net,
  id = "name",
  dependentVariable = "countTheftBldg18",
  indVariables = reg.vars) %>%
    dplyr::select(cvID = name, countTheftBldg18, Prediction, geometry)

reg.ss.spatialCV <- crossValidate(
  dataset = final_net,
  id = "name",
  dependentVariable = "countTheftBldg18",
  indVariables = reg.ss.vars) %>%
    dplyr::select(cvID = name, countTheftBldg18, Prediction, geometry)

reg.summary <-
  rbind(
  mutate(reg.cv, 
         Error = Prediction - countTheftBldg18, 
         Regression = "Random k-fold CV: Just Risk Factors"),
    mutate(reg.ss.cv, 
         Error = Prediction - countTheftBldg18, 
         Regression = "Random k-fold CV: Spatial Process"),
      mutate(reg.spatialCV, 
         Error = Prediction - countTheftBldg18, 
         Regression = "Spatial LOGO-CV: Just Risk Factors"),
        mutate(reg.ss.spatialCV, 
         Error = Prediction - countTheftBldg18, 
         Regression = "Spatial LOGO-CV: Spatial Process")
  ) %>% 
  st_sf()

error_by_reg_and_fold <-
  reg.summary %>%
    group_by(Regression, cvID) %>%
    summarize(Mean_Error = mean(Prediction - countTheftBldg18, na.rm = T),
              MAE = mean(abs(Mean_Error), na.rm = T),
              SD_MAE = mean(abs(Mean_Error), na.rm = T)) %>%
  ungroup()

error_by_reg_and_fold %>%
  ggplot(aes(MAE)) +
    geom_histogram(bins = 30, colour = "black", fill = "#FED725FF") +
    facet_wrap(~Regression) +
    geom_vline(xintercept = 0) + scale_x_continuous(breaks = seq(0, 8, by = 1)) +
    labs(title = 'Distribution of MAE', subtitle = "k-fold cross validation vs. LOGO-CV",
         x = "Mean Absoulte Error", y = "Count")

Figure 7.1 | Model errors distribution

For random k-fold cross-validation, the model taking spatial process into account outperformed the one with just risk factors as observed from a higher count of grids with MAE in [0,1]. For spatial LOGO-cross-validation, some grids appeared to have higher MAEs. We should map out the errors to see where those errors stood out.

st_drop_geometry(error_by_reg_and_fold) %>%
  group_by(Regression) %>%
    summarize(Mean_MAE = round(mean(MAE), 2),
              SD_MAE = round(sd(MAE), 2)) %>%
  kable(caption = "MAE by regression") %>%
    kable_styling("striped", full_width = F) %>%
            row_spec(2, color = "white", background = '#7303c0') %>%
            row_spec(4, color = "white", background = '#7303c0')

MAE by regression
Regression	Mean_MAE	SD_MAE
Randon k-fold CV: Just Risk Factors	0.60	0.53
Randon k-fold CV: Spatial Process	0.50	0.44
Spatial LOGO-CV: Just Risk Factors	1.88	3.20
Spatial LOGO-CV: Spatial Process	1.30	1.97

Table 1 | MAE & standard deviation MAE by Regression

The result confirmed our view that the Spatial Process features improved the model, since the mean errors were lower for spatial process model that those for the original model.

error_by_reg_and_fold %>%
  filter(str_detect(Regression, "LOGO")) %>%
  ggplot() +
    geom_sf(aes(fill = MAE)) +
    facet_wrap(~Regression) +
    scale_fill_viridis() +
    labs(title = "Theft from Building errors by LOGO-CV Regression") +
    mapTheme() + theme(legend.position="bottom")

Figure 7.2 | Spatial visualization of errors for theft from building

Figure 7.2 visualized the LOGO-CV errors spatially. The largest errors were situated in the hotspot locations, where the local spatial process was not accounted for.

Generalizability by neighborhood context

We retrieved 2018 census data for Chicago from American Census Bureau.

tracts18 <- 
  get_acs(geography = "tract", variables = c("B01001_001E","B01001A_001E"), 
          year = 2018, state=17, county=031, geometry=T) %>%
  st_transform('ESRI:102271')  %>% 
  dplyr::select(variable, estimate, GEOID) %>%
  spread(variable, estimate) %>%
  rename(TotalPop = B01001_001,
         NumberWhites = B01001A_001) %>%
  mutate(percentWhite = NumberWhites / TotalPop,
         raceContext = ifelse(percentWhite > .5, "Majority_White", "Majority_Non_White")) %>%
  .[neighborhoods,]

reg.summary %>% 
  filter(str_detect(Regression, "LOGO")) %>%
    st_centroid() %>%
    st_join(tracts18) %>%
    na.omit() %>%
      st_drop_geometry() %>%
      group_by(Regression, raceContext) %>%
      summarize(mean.Error = mean(Error, na.rm = T)) %>%
      spread(raceContext, mean.Error) %>%
      kable(caption = "Mean Error by neighborhood racial context") %>%
        kable_styling("striped", full_width = F) %>%
        row_spec(2, color = "white", background = '#7303c0')

Mean Error by neighborhood racial context
Regression	Majority_Non_White	Majority_White
Spatial LOGO-CV: Just Risk Factors	0.0466187	-0.0258419
Spatial LOGO-CV: Spatial Process	-0.1465534	0.1583902

Table 2 | Model error by neighborhood racial context

The original model on average, over-predicted in Majority_Non_White neighborhoods and under-predicted in Majority_White neighborhoods. On the contrary, the Spatial Process model under-predicted the Majority_Non_White neighborhoods and over-predicted in Majority_White neighborhoods. One more thing worth mentioning was that the Spatial Process model reported higher errors overall, but a smaller difference in errors across neighborhood context. The cause of the difference between the performance of the two models might exist in the distributions of local racial groups, especially in the hotspot locations.

Kernel density Model

This chunk of code prepare the common model based on kernel density.

theftBldg_ppp <- as.ppp(st_coordinates(theftBldg18), W = st_bbox(final_net))
theftBldg_KD <- spatstat::density.ppp(theftBldg_ppp, 1000)

as.data.frame(theftBldg_KD) %>%
  st_as_sf(coords = c("x", "y"), crs = st_crs(final_net)) %>%
  aggregate(., final_net, mean) %>%
   ggplot() +
     geom_sf(aes(fill=value)) +
     geom_sf(data = sample_n(theftBldg18, 1500), size = .25) +
     scale_fill_viridis(name = "Density", option = 'C') +
     labs(title = "Kernel density of 2018 thefts from building") +
     mapTheme()

theftBldg19 <- 
  read.socrata("https://data.cityofchicago.org/Public-Safety/Crimes-2019/w98m-zvie") %>% 
  filter(Primary.Type == "THEFT" & Description == "FROM BUILDING") %>%
  mutate(x = gsub("[()]", "", Location)) %>%
  separate(x,into= c("Y","X"), sep=",") %>%
  mutate(X = as.numeric(X), Y = as.numeric(Y)) %>% 
  na.omit %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
  st_transform('ESRI:102271') %>% 
  distinct() %>% .[fishnet,]

theftBldg_KDE_sf <- as.data.frame(theftBldg_KD) %>%
  st_as_sf(coords = c("x", "y"), crs = st_crs(final_net)) %>%
  aggregate(., final_net, mean) %>%
  mutate(label = "Kernel Density",
         Risk_Category = ntile(value, 100),
         Risk_Category = case_when(
           Risk_Category >= 90 ~ "90% to 100%",
           Risk_Category >= 70 & Risk_Category <= 89 ~ "70% to 89%",
           Risk_Category >= 50 & Risk_Category <= 69 ~ "50% to 69%",
           Risk_Category >= 30 & Risk_Category <= 49 ~ "30% to 49%",
           Risk_Category >= 1 & Risk_Category <= 29 ~ "1% to 29%")) %>%
  cbind(
    aggregate(
      dplyr::select(theftBldg19) %>% mutate(theftBldgCount = 1), ., sum) %>%
    mutate(theftBldgCount = replace_na(theftBldgCount, 0))) %>%
  dplyr::select(label, Risk_Category, theftBldgCount)

Risk predictions

This chunk of code prepare the risk prediction model.

theftBldg_risk_sf <-
  filter(reg.summary, Regression == "Spatial LOGO-CV: Spatial Process") %>%
  mutate(label = "Risk Predictions",
         Risk_Category = ntile(Prediction, 100),
         Risk_Category = case_when(
           Risk_Category >= 90 ~ "90% to 100%",
           Risk_Category >= 70 & Risk_Category <= 89 ~ "70% to 89%",
           Risk_Category >= 50 & Risk_Category <= 69 ~ "50% to 69%",
           Risk_Category >= 30 & Risk_Category <= 49 ~ "30% to 49%",
           Risk_Category >= 1 & Risk_Category <= 29 ~ "1% to 29%")) %>%
  cbind(
    aggregate(
      dplyr::select(theftBldg19) %>% mutate(theftBldgCount = 1), ., sum) %>%
      mutate(theftBldgCount = replace_na(theftBldgCount, 0))) %>%
  dplyr::select(label,Risk_Category, theftBldgCount)

Integration & Comparison

rbind(theftBldg_KDE_sf, theftBldg_risk_sf) %>%
  na.omit() %>%
  gather(Variable, Value, -label, -Risk_Category, -geometry) %>%
  ggplot() +
    geom_sf(aes(fill = Risk_Category), colour = NA) +
    geom_sf(data = sample_n(theftBldg18, 3000), size = .25, colour = "black") +
    facet_wrap(~label, ) +
    scale_fill_viridis(discrete = TRUE, option = 'C') +
    labs(title="Comparison of Kernel Density and Risk Predictions",
         subtitle="2018 thefts from building risk predictions; 2019 thefts from building") +
    mapTheme()

Figure 7 | Comparison of Kernel Density and Risk Predictions

For each grid cell and model type (kernel density vs. risk prediction), there was now an associated risk category and 2019 theft from building count. A strongly fit model should show that the highest risk category is uniquely targeted to places with a high density of theft from building points, and as shown in Figure 8, both models seemed to capture the theft from bulidings incidents well.

rbind(theftBldg_KDE_sf, theftBldg_risk_sf) %>%
  st_set_geometry(NULL) %>% na.omit() %>%
  gather(Variable, Value, -label, -Risk_Category) %>%
  group_by(label, Risk_Category) %>%
  summarize(countTheftBldg18 = sum(Value)) %>%
  ungroup() %>%
  group_by(label) %>%
  mutate(Rate_of_test_set_crimes = countTheftBldg18 / sum(countTheftBldg18)) %>%
    ggplot(aes(Risk_Category,Rate_of_test_set_crimes)) +
      geom_bar(aes(fill=label), position="dodge", stat="identity") +
      scale_fill_viridis(discrete = TRUE) +
      labs(title = "Risk prediction vs. Kernel density, 2019 thefts from building") +
      theme(axis.text.x = element_text(angle = 45, vjust = 0.5))

Figure 8 | Risk prediction vs. Kernel density, 2019 thefts from building

As quoted from the textbook, "a well fit model should show that the risk predictions capture a greater share of 2018 burglaries in the highest risk category relative to the Kernel density:" the risk prediction model narrowly edged out the Kernel Density in the highest risk category, as well as the three categoris counted from the lowest although in the category of 70% to 89%, the kernel density model outperformed the risk prediction model.

Discussion

In the project, we built a geospatial risk prediction model that took in the observed/reported theft from building incidents from the past, and tested its accuracy and generalizability. In random K-fold and the more preservative spatial LOGO cross-validation models, the mean errors remained low, with the lowest about 0.5 for spatial process (k-fold cv). Also, the difference in accuracy among majority-non-white and majority-white neighborhoods were low (about 0.07 for risk factors only and about 0.3 for spatial process).

We would recommend the use of this risk prediction model to assist the guidance and planning of public safety power. The investment in such a predictive tool would help reduce future loss from crime incidents, espeically theft from building - advanced identification system, regular inspection on equipments, better zonal planning and site planning for locations with large population flow daily. The accuracy and generalizability of this model were acceptable, and outweighed the hotspot model to some extent by its ability to predict crime incidents before they really took place.

Data sources

Chicago City Data Portal: https://data.cityofchicago.org/

American Census Bureau - ACS: https://www.census.gov/programs-surveys/acs

Public Policy Analytics: Code & Context for Data Science in Government: https://urbanspatial.github.io/PublicPolicyAnalytics/