1. Introduction

In the market of real estates, the fluctuation of home prices often tangled with tons of possible factors related to public policy, environment, economic development, etc. A well-tuned prediction model for housing market prices can equip the stakeholders like Zillow with competency and reliability in market planning. This project aimed at providing a prediction model for housing prices taking interesting predictors of "local intelligence" - possible variables correlated with home prices based on existing sales folios from 2018 to 2020 in Miami and Miami Beach, Florida as our training data set with Ordinary Least Square (OLS) regression done in RStudio. The OLS regression mainly focuses on the linear relationship between our predictors and the sale price - the set of values we would like to predict, and there exist a great many advanced models with lower errors than ours indeed. Data sources for this project include Miami-Dade county Open Data platform, Open Street Map, and, the US Census Bureau. We developed and tested two OLS regression models, "baseline" and "neighborhood effects", with all the same predictors but one, neighborhood name, taken into account by the latter model.

The beginning of the journey is to search for "good" predictors - which might includes geographic/spatial structures, demographic statistics, internal characteristics of the housing units, etc. The difficulties lie in the variables and the model themselves, since we trained the models based on limited known features and relied on the model to predict the unknown. Is the model generablizable? That is to say, can it be applied to features outside the training data?

The final chosen model for housing price prediction followed the baseline model with lower prediction errors and better generalizability. Please visit: for comprehensive codes for this project (including the data wrangling part).

library(tidyverse)
library(tidycensus)
library(sf)
library(spdep)
library(caret)
library(ckanr)
library(FNN)
library(grid)
library(gridExtra)
library(ggcorrplot)
library(jtools)
library(osmdata)
library(geosphere)
library(data.table)
library(kableExtra) 
library(stargazer)
options(scipen=999)
options(tigris_class = "sf")
options(tigris_use_cache = TRUE)
census_api_key("61f5998ae3ee7a7ab62b29f55769a3b864515ac4", overwrite = TRUE, install = TRUE)
readRenviron("~/.Renviron")

# functions
mapTheme <- function(base_size = 12) {
  theme(
    text = element_text( color = "black"),
    plot.title = element_text(size = 14,colour = "black"),
    plot.subtitle=element_text(face="italic"),
    plot.caption=element_text(hjust=0),
    axis.ticks = element_blank(),
    panel.background = element_blank(),axis.title = element_blank(),
    axis.text = element_blank(),
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    panel.grid.minor = element_blank(),
    panel.border = element_rect(colour = "black", fill=NA, size=2)
  )
}

plotTheme <- function(base_size = 12) {
  theme(
    text = element_text( color = "black"),
    plot.title = element_text(size = 14,colour = "black"),
    plot.subtitle = element_text(face="italic"),
    plot.caption = element_text(hjust=0),
    axis.ticks = element_blank(),
    panel.background = element_blank(),
    panel.grid.major = element_line("grey80", size = 0.1),
    panel.grid.minor = element_blank(),
    panel.border = element_rect(colour = "black", fill=NA, size=2),
    strip.background = element_rect(fill = "grey80", color = "white"),
    strip.text = element_text(size=12),
    axis.title = element_text(size=12),
    axis.text = element_text(size=10),
    plot.background = element_blank(),
    legend.background = element_blank(),
    legend.title = element_text(colour = "black", face = "italic"),
    legend.text = element_text(colour = "black", face = "italic"),
    strip.text.x = element_text(size = 14)
  )
}

palette5 <- c("#25CB10", "#5AB60C", "#8FA108",   "#C48C04", "#FA7800")

qBr <- function(df, variable, rnd) {
  if (missing(rnd)) {
    as.character(quantile(round(df[[variable]],0),
                          c(.01,.2,.4,.6,.8), na.rm=T))
  } else if (rnd == FALSE | rnd == F) {
    as.character(formatC(quantile(df[[variable]]), digits = 3),
                          c(.01,.2,.4,.6,.8), na.rm=T)
  }
}

q5 <- function(variable) {as.factor(ntile(variable, 5))}

nn_function <- function(measureFrom,measureTo,k) {
  measureFrom_Matrix <- as.matrix(measureFrom)
  measureTo_Matrix <- as.matrix(measureTo)
  nn <-   
    get.knnx(measureTo, measureFrom, k)$nn.dist
    output <-
      as.data.frame(nn) %>%
      rownames_to_column(var = "thisPoint") %>%
      gather(points, point_distance, V1:ncol(.)) %>%
      arrange(as.numeric(thisPoint)) %>%
      group_by(thisPoint) %>%
      summarize(pointDistance = mean(point_distance)) %>%
      arrange(as.numeric(thisPoint)) %>% 
      dplyr::select(-thisPoint) %>%
      pull()
  
  return(output)  
}

load('beforeNeighbor.RData')

2. Data Wrangling

The data in our training model consists of two parts, dependent variables and independent variables. In this case, the dependent variable is house price in Miami, which is what we want to predict. The independent variables are some predictors may influence the fluctuation of house price.

2.1 Dependent Variables

Miami home sale price data is from Zillow, including 2627 items. The data is across the district of Miami and Miami Beach. The following map below illustrates the home sale price in Miami. It indicates that house value at Miami Beach is higher than Miami on the whole. And high value houses cluster nearby by coast.

ggplot() + 
  geom_sf(data = nhood_miami, fill = "azure") +
  geom_sf(data = nhood_miami_beach, fill = "azure") +
  geom_sf(data = miami.train, aes(colour = q5(SalePrice)), show.legend = "point", size = .75) + 
  scale_colour_manual(values = palette5, 
                      labels = qBr(miami.train, "SalePrice"),
                      name = "Quantile\nBreaks") + 
  labs(title = "Home Sale Price, Miami and Miami Beach",
        subtitle = "Real Dollars",
       caption = "Data resource: Zillow") + 
  mapTheme()

The distribution of home price per square feet maintain a high degree of consistency with home sale price.

ggplot() + 
  geom_sf(data = nhood_miami, fill = "azure") +
  geom_sf(data = nhood_miami_beach, fill = "azure") +
  geom_sf(data = miami.train, aes(colour = q5(PricePerSqr)), show.legend = "point", size = .75) + 
  scale_colour_manual(values = palette5, 
                      labels = qBr(miami.train, "PricePerSqr"),
                      name = "Quantile\nBreaks") + 
  labs(title = "Home Price Per Square Feet, Miami and Miami Beach",
        subtitle = "Real Dollars/Square Feet",
       caption = "Data resource: Zillow") + 
  mapTheme()

2.2 Independent Variables

We selected over 30 kinds of independent variables divided into four parts, internal characteristics, public services and safety, demographic characteristics and spatial structure. Below, we made summary statistics of variables (category variables are excluded).

Internal characteristics majority base on Zillow dataset, including descriptive information about house units, such as number of bedrooms and stories and whether having a pool. These aspects are essential for buyer to judge the condition and value of the house unit.

  • Sale year
  • City of the property
  • Mailing city
  • Zoning
  • Lot Size
  • Number of bedrooms
  • Number of bath
  • Stories
  • Pool
  • Elevator
  • Living square feet
  • House age
miami.summary<-
  miami.train%>%
st_set_geometry(NULL)

avatar Public services and safety are distance from home to amenities and assaults, wrangled from OpenStreetMap POI and Miami OpenData. The value contributed by amenities and public services varies based on family's preferences. In a family with kids, school is very important consideration. Workers without private vehicle are more likely to choose a house near metro or bus stations. 'amen_edu_nn3' means the average distance from the house to the 3 nearest educational place. Several variables are named after this methods.

  • Education (school, college, University)
  • Shop (mall, supermarket, department_store)
  • Office (financial, government)
  • Urban parks
  • Hospital
  • Parking facilities
  • Library
  • Fire station
  • Metro station
  • Distance to the coast
  • Sexual offender and predator
avatar

avatar

Demographic characteristics are wrangled from ACS 2018. We link the house unit with the tracts they spatially located in to have a general understanding of of the social environment. In some segregated urban area, house value are influenced by groups, social class and education background.

  • Medium gross rent
  • Medium household income
  • Percent of White
  • Percent of African Americans
  • Percent of Native Americans
  • Percent of Asian
  • Percent of vacant
  • Percent of Bachelors 
  • Percent of Poverty
  • Population density (total population/area of tract)
avatar

avatar

Everything is related to everything else, but near things are more related to each other(Tobler's First Law of Geography??.We can never neglect spatial correlation when doing urban exploration. Lag price and neighborhood as considered as spatial structure.

  • Lag price (Average price per feet of 5 nearest homes)
  • Neighborhood
avatar

avatar

2.3 Correlation Matrix of Continuous Variables

Correlation matrix aims to find out the relationship between every pair of variables. If the correlation of two variables is close to 1 or -1, we think there can be a collinearity phenomenon, which would influence the behavior of our model. Judging from the matrix, number of bath may have collinearity with living square feet and number of bedrooms. It makes sense with our common knowledge of house layout. Besides, collinearity also exists between the distance to the hospital and sex offender, medium house income and medium gross rent, percent of white and percent of African Americans. When deciding the final predictors in our model, we consider to drop some variables with collinearity or cast some to category variables.

miami.cor <- subset(miami.train, select = c('LotSize', 'Bed', 'Bath','Stories','LivingSqFt','Age',
                                            'amen_edu_nn5','shop_nn5', 'office_nn1', 'DistHospital', 
                                            'DistParking', 'DistFireSta', 'DistLibrary', 'CoastDist','DistSexOP',
                                            'MedGrossRent', 'MedHHInc.inf', 'pctWhite','pctAfrAm','pctAmNa','Density',
                                            'pctAsian', 'pctVacantLiveElwh', 'pctBachelors','pctPoverty','lagPrice'))
                                            
miami.cor <- 
  transform(miami.cor, Density = as.numeric(Density))

miami.cor <- 
  miami.cor %>% 
  st_set_geometry(NULL)

cormatrix <- cor(miami.cor, use = "pairwise.complete.obs")

ggcorrplot(cor(miami.cor, use = "pairwise.complete.obs"),
           colors = c("#25CB10", "white", "#FA7800"), 
           title="Correlation Matrix of Continuous Variables",
           ggtheme=mapTheme)

2.4 Quantitative And Spatially Correlation Exploration

Series of scatter plots are made to illustrate price as a function of continuous variables. Below, we choose four charts showing interesting and quite clear correlation with house value. The result of price and percent of vacant house is not that intuitive. One explanation is that people who can afford a luxury house usually possess several properties. So they may not always live at the same place.

st_drop_geometry(miami.train) %>% 
  filter(SalePrice <= 2000000) %>%
  dplyr::select('SalePrice','DistParking','CoastDist','Density','pctVacantLiveElwh') %>%
  gather(Variable, Value, -SalePrice) %>% 
  ggplot(aes(Value, SalePrice)) +
  geom_point(size = .5,alpha=0.2) + geom_smooth(method = "lm", se=F, colour = "#FA7800") +
  facet_wrap(~Variable, ncol = 2, scales = "free") +
  labs(title = "Price as a function of continuous variables") +
  plotTheme()

Three maps were set to compare with house value distribution. Through this way, we want to find out some inner spatial correlation between urban context and house price.

First map illustrates the distribution of education facilities. In general, Miami citizens have good accessibility to schools. Most houses are within 500m distance to school. The farthest distance from house to school is at around 2km. Considering most school supply with bus service and quite a lot high school students drive to school, it's not a very long trip. Consequently, comparing with the map on the right, the house value are not significantly influenced by the distance to school, benefiting from quite scattered educational resources.

grid.arrange(  
ggplot() + 
    geom_sf(data = nhood_miami, fill = "azure") +
    geom_sf(data = nhood_miami_beach, fill = "azure") +
    stat_density2d(data = data.frame(st_coordinates(amen_edu)), 
                 aes(X, Y, fill = ..level.., alpha = ..level..),
                 size = 0.01, bins = 40, geom = 'polygon') +
    scale_fill_gradient(low = "#C6FFDD", high = "#f7797d", name = "Density") +
    scale_alpha(range = c(0.00, 0.35), guide = FALSE) +
    geom_sf(data = miami.train, aes(colour = q5(amen_edu_nn5)), show.legend = "point", size = .75) + 
    scale_colour_manual(values = palette5, 
                      labels = qBr(miami.train, "amen_edu_nn5"),
                      name = "Quantile\nBreaks") + 
    labs(title = "Distance From Home to 5 Nearest Educational Place",
        subtitle = "Meters",
       caption = "Data resource: OpenStreetMAp") + 
    mapTheme(),

ggplot() + 
  geom_sf(data = nhood_miami, fill = "azure") +
  geom_sf(data = nhood_miami_beach, fill = "azure") +
  geom_sf(data = miami.train, aes(colour = q5(SalePrice)), show.legend = "point", size = .75) + 
  scale_colour_manual(values = palette5, 
                      labels = qBr(miami.train, "SalePrice"),
                      name = "Quantile\nBreaks") + 
  labs(title = "Home Sale Price",
        subtitle = "Real Dollars",
       caption = "Data resource: Zillow") + 
  mapTheme(),
ncol=2)

As a city by the sea, Miami is a well-known tourism destination for its golden coast. Maps below demonstrates a strong negative correlation between the distance to coast and home sale price. Houses close to the coast are almost all over 1 million dollars.

grid.arrange( 
ggplot() + 
    geom_sf(data = nhood_miami, fill = "azure") +
    geom_sf(data = nhood_miami_beach, fill = "azure") +
    geom_sf(data = miami.train, aes(colour = q5(CoastDist)), show.legend = "point", size = .75) + 
    scale_colour_manual(values = palette5, 
                      labels = qBr(miami.train, "CoastDist"),
                      name = "Quantile\nBreaks") + 
    labs(title = "Distance From Home to Coast",
        subtitle = "Meters",
       caption = "Data resource: OpenStreetMAp") + 
    mapTheme(),

ggplot() + 
  geom_sf(data = nhood_miami, fill = "azure") +
  geom_sf(data = nhood_miami_beach, fill = "azure") +
  geom_sf(data = miami.train, aes(colour = q5(SalePrice)), show.legend = "point", size = .75) + 
  scale_colour_manual(values = palette5, 
                      labels = qBr(miami.train, "SalePrice"),
                      name = "Quantile\nBreaks") + 
  labs(title = "Home Sale Price",
        subtitle = "Real Dollars",
       caption = "Data resource: Zillow") + 
  mapTheme(),
ncol=2)

From the map of house age distribution, we found old houses are well blended with new houses in Miami. Dominated by the influence of the coast, it seems that both of the old and new houses were sold at a relatively higher price. It's hard to tell the correlation with a glance of eye.

grid.arrange(
ggplot() + 
    geom_sf(data = nhood_miami, fill = "azure") +
    geom_sf(data = nhood_miami_beach, fill = "azure") +
    geom_sf(data = miami.train, aes(colour = q5(Age)), show.legend = "point", size = .75) + 
    scale_colour_manual(values = palette5, 
                      labels = qBr(miami.train, "Age"),
                      name = "Quantile\nBreaks") + 
    labs(title = "House Age Distribition",
        subtitle = "Year",
       caption = "Data resource: Zillow") + 
    mapTheme(),

ggplot() + 
  geom_sf(data = nhood_miami, fill = "azure") +
  geom_sf(data = nhood_miami_beach, fill = "azure") +
  geom_sf(data = miami.train, aes(colour = q5(SalePrice)), show.legend = "point", size = .75) + 
  scale_colour_manual(values = palette5, 
                      labels = qBr(miami.train, "SalePrice"),
                      name = "Quantile\nBreaks") + 
  labs(title = "Home Sale Price",
        subtitle = "Real Dollars",
       caption = "Data resource: Zillow") + 
  mapTheme(),
ncol=2)

3. Methods

Datasets downloaded were tested for co-linearity by a series of correlation matrices generated by RStudio, as described above in the data wrangling and organization process.

After cleaning the data we obtained from both the provided dataset of home sales and from the open data platforms, we have a wholesome dataset of housing units with selected predictors for training. Note: the housing units left for prediction competition were filtered out before training.

Training

We randomly partitioned the integrated dataset into two parts: 60% training and 40% testing and generated two kinds of regressions, the baseline regression without neighborhood data and the one used to explore the effects of neighborhood.

Validation

The validation process included single test (with the randomly assigned data for testing) and cross validation test. The workflow of the single test was just applying what the regression model was "taught" - by fiting the training data. We would like to see if the trained models work as expected on the "unfamiliar" dataset left for testing. For cross validation, the data partitioning process was more complex: we separated the whole dataset into 100 folds and extracted several amount of data from each fold for testing, and used the remaining parts for training. We also tested the models on their generalizability - applying to different groups (by race, income, etc.)

What specific statistics we would like to see for the validation process?

Accuracy. Of course. Generally we would like to minimize the errors, or the deviation from the correct value. The following statistics were emphasized in this project:

MAE: Mean Absolute Error (MAE) is calculated by taking the mean of the absolute values of subtracting each predicted housing prices from the corresponding given prices in the testing dataset.

MAPE: Mean Absolute Percent Error (MAPE) is the mean of Absolute Percent Error of each validation pairs - the difference between predicted and observed prices on a percentage basis.

MSE: Mean Squared Error (MSE) - the averaged deviation from the correct values for the whole dataset.

RMSE: Root Mean Squared Error (RMSE), the square root of MSE, at the same scale as the predicted variable.

R-squared - how much of the change in the housing price can be explained by changes in the predictors, i.e., is the model meaningful?

4. Results

We splited our integrated dataset based on a pre-defined column "toPredict" to distinct the data for modeling and for prediction purposes. - housing units with recorded sale prices for training and validation ("toPredict = 0") - housing units without recorded sale prices, for predictions ("toPredict = 1")

miami.organize.n <- st_join(miami.organize, nhoods) 
# split training and predicting set (pre-defined)
miami.train <- filter(miami.organize, toPredict == 0)
miami.predict <- filter(miami.organize, toPredict == 1)
miami.train.n <- filter(miami.organize.n, toPredict == 0)
miami.predict.n <- filter(miami.organize.n, toPredict == 1)

# set random seed
set.seed(12345)

# get index for training sample
inTrain <- createDataPartition(
  y = paste(miami.train$Elevator, miami.train$CoastDist.cat, miami.train.n$NAME, miami.train$Property.City,
            miami.train$Stories.cat, miami.train$Pool, miami.train$Age.cat, miami.train$Zoning.cat),
  p = .60, list = FALSE)

# split data into training and test
miami.training <- miami.train[inTrain,] 
miami.test     <- miami.train[-inTrain,]  
miami.training.n <- miami.train.n[inTrain,] 
miami.test.n     <- miami.train.n[-inTrain,]  

# Regression - baseline
reg.bl <- lm(SalePrice ~ ., data = st_drop_geometry(miami.training) %>% 
             dplyr::select (SalePrice, Density, pctAfrAm, pctAsian,DistSexOP, 
                           pctBachelors, office_nn5,
                           saleYear,Stories.cat,
                           lagPrice,  DistLibrary, 
                           CoastDist, Zoning.cat, Mailing, TOD, Pool,Elevator,Property.City))
# Regression - neighborhood 
reg.nhoods <- lm(SalePrice ~ ., data = st_drop_geometry(miami.training.n) %>% 
             dplyr::select (NAME, SalePrice, Density, pctAfrAm, pctAsian,DistSexOP, 
                           pctBachelors, office_nn5,
                           saleYear,Stories.cat,
                           lagPrice,  DistLibrary, 
                           CoastDist, Zoning.cat, Mailing, TOD, Pool,Elevator,Property.City))

The following table collected the regression outputs - coefficients, p-values, and the R-square values for both regression models. This table might be a bit long than as required in professional report quality but I have not figured out a way to make the table foldable or draggable in viewing.

# Regression results output
library(sjPlot)
library(sjmisc)
library(sjlabelled)

tab_model(reg.bl, reg.nhoods, auto.label = TRUE, show.ci = FALSE, show.intercept = FALSE, 
          title = 'Summary of Training Results',
          dv.labels = c("Baseline Regression", "Neighborhood Effects"),
          string.pred = "Coefficient")
Summary of Training Results
  Baseline Regression Neighborhood Effects
Coefficient Estimates p Estimates p
Density -216.17 0.910 -4355.53 0.171
pctAfrAm 236050.28 0.100 988730.07 0.059
pctAsian 3055341.97 0.114 -2172864.02 0.517
DistSexOP -19.51 0.400 -12.89 0.673
pctBachelors 1706744.90 0.011 2668868.34 0.030
office_nn5 -6.44 0.453 -32.53 0.030
saleYear -39074.44 0.259 -32716.99 0.354
Stories.cat [4+ Floors] 4237472.68 <0.001 3730317.71 <0.001
Stories.cat [Up to 2
Floors]
-1705226.13 <0.001 -1596652.35 <0.001
lagPrice 2101.71 <0.001 1895.52 <0.001
DistLibrary 15.65 0.700 25.04 0.673
CoastDist 14.55 0.042 40.75 0.071
Zoning.cat [ESTATES -
25000 SQFT LOT]
3570536.59 <0.001 3652126.42 <0.001
Zoning.cat [Others] -6474772.03 <0.001 -6545702.54 <0.001
Zoning.cat
[RESIDENTIAL-LIBERAL
RETAI]
-7093802.42 <0.001 -7346981.35 <0.001
Zoning.cat
[RESIDENTIAL-MEDIUM
RETAIL]
-6401379.65 <0.001 -6534783.06 <0.001
Mailing [MIAMI BEACH] -295119.71 0.503 -375266.31 0.399
Mailing [Other City] -105186.64 0.809 -169671.29 0.699
Mailing [WILMINGTON] 15693706.41 <0.001 15522752.91 <0.001
TOD [TOD] 136743.93 0.645 133322.99 0.669
Pool [non-Pool] -5359088.47 <0.001 -5369075.50 <0.001
Pool [Pool] -4702972.15 <0.001 -4742958.21 <0.001
Elevatornon-Elevator -3211470.99 <0.001 -3177977.91 <0.001
Property.City [Miami
Beach]
1044561.32 <0.001
NAME [Allapattah
Industrial District]
-583313.35 0.656
NAME [Auburndale] -550451.49 0.637
NAME [Bay Heights] -777799.63 0.505
NAME [Baypoint] 38062.63 0.973
NAME [Bayside] -666560.62 0.549
NAME [Belle Island] 534687.69 0.658
NAME [Belle Meade] -744428.61 0.501
NAME [Belle Meade West] -731070.98 0.518
NAME [Bird Grove East] -417836.51 0.720
NAME [Bird Grove West] -192090.31 0.866
NAME [Biscayne Island] 2276629.42 0.095
NAME [Biscayne Plaza] -384738.83 0.798
NAME [Brentwood] -677928.72 0.582
NAME [Buena Vista
Heights]
-867347.72 0.435
NAME [Buena Vista West] -937148.36 0.389
NAME [Citrus Grove] -235020.84 0.842
NAME [Civic Center] -490048.22 0.716
NAME [Coral Gate] -447202.89 0.698
NAME [Curtis Park] -458915.11 0.707
NAME [Douglas Park] -49108.15 0.965
NAME [East Grove] -13890.07 0.990
NAME [East Little Havana] -301750.89 0.803
NAME [Edgewater] -386439.19 0.760
NAME [Edison] -770213.36 0.473
NAME [Fair Isle] -554091.66 0.631
NAME [Flagami] -320075.32 0.785
NAME [Flora Park] -1041390.26 0.347
NAME [Grove Center] -678367.32 0.557
NAME [Hadley Park] -849625.08 0.437
NAME [Highland Park] -548960.13 0.719
NAME [Historic Buena
Vista East]
-759117.66 0.505
NAME [King Heights] -897965.25 0.415
NAME [La Pastorita] -608317.16 0.620
NAME [Latin Quarter] 122026.43 0.927
NAME [Le Jeune Gardens] -484431.26 0.757
NAME [Lemon City/Little
Haiti]
-682089.01 0.532
NAME [Liberty Square] -752798.54 0.490
NAME [Little River
Central]
-566927.97 0.606
NAME [Little River
Gardens]
-535133.93 0.674
NAME [Melrose] -540616.14 0.651
NAME [Miami Avenue] 193756.85 0.868
NAME [MIAMI BEACH] 437493.94 0.698
NAME [Morningside] -737979.39 0.511
NAME [North Grapeland
Heights]
-757215.62 0.531
NAME [North Grove] -472690.16 0.677
NAME [North Sewell Park] -642392.17 0.597
NAME [Northeast Overtown] -1257493.35 0.300
NAME [Northwestern
Estates]
-812926.56 0.455
NAME [Oakland Grove] -553001.52 0.707
NAME [Old San Juan] -673389.79 0.556
NAME [Orange Bowl] -138398.85 0.923
NAME [Orchard Villa] -956686.13 0.386
NAME [Palm Grove] -759414.31 0.545
NAME [Parkdale North] -481007.37 0.681
NAME [Parkdale South] -355666.66 0.760
NAME [Roads] -487052.80 0.667
NAME [San Marco Island] -1115216.80 0.391
NAME [Santa Clara] -553817.45 0.627
NAME [Shenandoah North] -319987.99 0.779
NAME [Shenandoah South] -342331.07 0.761
NAME [Shorecrest] -468654.19 0.664
NAME [Silver Bluff] -265593.35 0.813
NAME [South Grapeland
Heights]
-501550.91 0.680
NAME [South Grove] 159087.90 0.889
NAME [South Grove
Bayside]
1134342.69 0.329
NAME [South Sewell Park] -753769.10 0.528
NAME [Spring Garden] -1256818.34 0.326
NAME [West Grapeland
Heights]
-681900.10 0.580
NAME [West Grove] -411071.59 0.703
Observations 1711 1710
R2 / R2 adjusted 0.805 / 0.802 0.813 / 0.803

Table 4.1

Both regression models returned a R-squared value higher than 0.8; roughly 80% variations in housing prices might be explained by the changes in those predictors as a whole. Noticing that only one neighborhood in Miami/Miami Beach as a predictor, Biscayne Island, was considered statistically significant at a significant level of 0.05. We might anticipate that the two regression models might not differ so much. Predictors with p-values lower than 0.05 were marked in dark in the table, like pctBachelors, Zoning.cat, etc.

Single Test

Test Set Prediction Results - Baseline Regression
Test.bl.MAE 273051.1276805
Test.bl.MAPE 0.4891427
Test Set Prediction Results - Neighborhood Effects
Test.nhoods.MAE 274115.7410136
Test.nhoods.MAPE 0.5089556

Table 4.2

Both models returned a MAE around $280,000. The error is not trivial given the mean SalePrice for miami.test is $785,475.2. The MAPE suggested that both models erred by about 50% or more.

Cross Validation

Cross Validation - Baseline Regression Prediction Results
cv.bl.RMSE 908218.3018511
cv.bl.MAE 435678.6465503
cv.bl.Rsquared 0.7087511
Cross Validation - Neighborhood Effects Regression Prediction Results
cv.nhoods.RMSE 972989.3794234
cv.nhoods.MAE 448013.6326744
cv.nhoods.Rsquared 0.7085264

Table 4.3

The 100-k-fold cross validation tests indicated higher errors in our regression models. The models' ability in predicting prices in data sets their had never seen before were not satisfying. Generalizable? Not yet.

grid.arrange(nrow = 2, 
  cv.bl.results %>% 
    pivot_longer(-Resample) %>% 
    mutate(name = as.factor(name)) %>%
    ggplot(., aes(x = name, y = value, color = name)) +
    geom_jitter(width = 0.1) +
    facet_wrap(~name, ncol = 3, scales = "free") +
    labs(x = "Summary of OLS Baseline Regression - Cross Validation") + plotTheme(),
  cv.nhoods.results %>% 
    pivot_longer(-Resample) %>% 
    mutate(name = as.factor(name)) %>%
    ggplot(., aes(x = name, y = value, color = name)) +
    geom_jitter(width = 0.1) +
    facet_wrap(~name, ncol = 3, scales = "free") +
    labs(x = "Summary of OLS Neighborhood Effects Regression - Cross Validation", 
         caption = 'Figure 4.1') + plotTheme()
)

grid.arrange(nrow = 2,
             ggplot(as.data.frame(cv.bl.results), aes(MAE)) + 
  geom_histogram(bins = 50, colour="white", fill = "#FFD365") +
  labs(title="Baseline Regression", subtitle = "k-fold cross validation; k = 100",
       x="Mean Absolute Error (MAE)", y="Count") +
  plotTheme(),
  ggplot(as.data.frame(cv.nhoods.results), aes(MAE)) + 
  geom_histogram(bins = 50, colour="white", fill = "#FFD365") +
  labs(title="Neighborhood Effects", subtitle = "k-fold cross validation; k = 100",
       x="Mean Absolute Error (MAE)", y="Count", caption = 'Figure 4.2') +
  plotTheme()
)

# extract predictions from CV object
cv_preds.bl <- reg.cv.bl$pred
cv_preds.nhoods <- reg.cv.nhoods$pred

## Create dataset with "out of fold" predictions and original data
map_preds.bl <- miami.train %>% 
  rowid_to_column(var = "rowIndex") %>% 
  left_join(cv_preds.bl, by = "rowIndex") %>% 
  mutate(SalePrice.AbsError = abs(pred - SalePrice)) %>% 
  cbind(st_coordinates(.))
map_preds.nhoods <- miami.train.n %>% 
  rowid_to_column(var = "rowIndex") %>% 
  left_join(cv_preds.nhoods, by = "rowIndex") %>% 
  mutate(SalePrice.AbsError = abs(pred - SalePrice)) %>% 
  cbind(st_coordinates(.))

# CRS fix
st_crs(map_preds.bl) <- st_crs(miami.train)
st_crs(map_preds.nhoods) <- st_crs(miami.train.n)
# plot errors on a map
ggplot() + 
  geom_sf(data = nhoods, fill = "azure") +
  geom_sf(data = map_preds.bl, aes(colour = q5(SalePrice.AbsError)),
          show.legend = "point", size = 0.75) +
  scale_colour_manual(values = palette5,
                      labels=qBr(map_preds.bl,"SalePrice.AbsError"),
                      name="Quintile\nBreaks") +
  labs(title="Absolute sale price errors on the OOF set",
       subtitle = "OOF = 'Out Of Fold; Baseline Regression'",
       caption = 'Figure 4.3.1 | Data source: Zillow, OpenStreetMap, Miami-Dade County Open Data, US Census Bureau') +
  mapTheme()

ggplot() + 
  geom_sf(data = nhoods, fill = "azure") +
  geom_sf(data = map_preds.nhoods, aes(colour = q5(SalePrice.AbsError)),
          show.legend = "point", size = 0.75) +
  scale_colour_manual(values = palette5,
                      labels=qBr(map_preds.nhoods,"SalePrice.AbsError"),
                      name="Quintile\nBreaks") +
  labs(title="Absolute sale price errors on the OOF set",
       subtitle = "OOF = 'Out Of Fold'; Neighborhood Effects",
       caption = 'Figure 4.3.2 | Data source: Zillow, OpenStreetMap, Miami-Dade County Open Data, US Census Bureau') +
  mapTheme()

Predicted VS. Observed Home Prices

grid.arrange(nrow = 2,
ggplot(map_preds.bl, aes(pred, obs)) +
  geom_point(size = 0.75, colour = "black") +
  geom_line(data = map_preds.bl, aes(obs, obs),
              method = "lm", se = FALSE, size = 1, colour="orange") +
  stat_smooth(data = map_preds.bl, aes(obs, pred),
              method = "lm", se = FALSE, size = 1, colour='blue') +
  labs(title="Predicted sale price as a function of observed price",
       subtitle="Orange line: observation; Blue line: prediction", 
       x = 'Predicted Sale Price', y = 'Observed Sale Price') +
  plotTheme(),
ggplot(map_preds.nhoods, aes(pred, obs)) +
  geom_point(size = 0.75, colour = "black") +
  geom_line(data = map_preds.nhoods, aes(obs, obs),
              method = "lm", se = FALSE, size = 1, colour="orange") +
  stat_smooth(data = map_preds.nhoods, aes(obs, pred),
              method = "lm", se = FALSE, size = 1, colour='blue') +
  labs(title="Predicted sale price as a function of observed price",
       subtitle="Orange line: observation; Blue line: prediction", 
       x = 'Predicted Sale Price', y = 'Observed Sale Price',
       caption = 'Figure 4.4') +
  plotTheme()  
)

Spatial Correlation

k_nearest_neighbors = 5
#errors
coords.test.bl <-  st_coordinates(miami.test.bl) 
neighborList.test.bl <- knn2nb(knearneigh(coords.test.bl, k_nearest_neighbors))
spatialWeights.test.bl <- nb2listw(neighborList.test.bl, style="W")
miami.test.bl$lagPriceError <- lag.listw(spatialWeights.test.bl, miami.test.bl$SalePrice.AbsError)

coords.test.nhoods <-  st_coordinates(miami.test.nhoods) 
neighborList.test.nhoods <- knn2nb(knearneigh(coords.test.nhoods, k_nearest_neighbors))
spatialWeights.test.nhoods <- nb2listw(neighborList.test.nhoods, style="W")
miami.test.nhoods$lagPriceError <- lag.listw(spatialWeights.test.nhoods, miami.test.nhoods$SalePrice.AbsError)

# ggplot(miami.organize, aes(x=lagPrice, y=SalePrice)) +
#    geom_point(colour = "#FA7800") +
#    geom_smooth(method = "lm", se = FALSE, colour = "#25CB10") +
#    labs(title = "Price as a function of the spatial lag of price",
#         x = "Spatial lag of price (Mean price of 5 nearest neighbors)",
#         y = "Sale Price") +
#    plotTheme(),

grid.arrange(nrow = 2,
  ggplot(miami.test.bl, aes(x=lagPriceError, y=SalePrice)) +
      geom_point(colour = "#FA7800") +
      geom_smooth(method = "lm", se = FALSE, colour = "#25CB10") +
      labs(title = "Error as a function of the spatial lag of price",
           subtitle = "Baseline Regression",
           x = "Spatial lag of errors (Mean error of 5 nearest neighbors)",
           y = "Sale Price") +
      plotTheme(),
    ggplot(miami.test.nhoods, aes(x=lagPriceError, y=SalePrice)) +
      geom_point(colour = "#FA7800") +
      geom_smooth(method = "lm", se = FALSE, colour = "#25CB10") +
      labs(title = "Error as a function of the spatial lag of price",
           subtitle = "Neighborhood Effects",
           x = "Spatial lag of errors (Mean error of 5 nearest neighbors)",
           y = "Sale Price", caption = 'Figure 4.6') +
      plotTheme()
)

moranTest.bl <- moran.mc(miami.test.bl$SalePrice.AbsError, 
                      spatialWeights.test.bl, nsim = 999)
moranTest.nhoods <- moran.mc(miami.test.nhoods$SalePrice.AbsError, 
                      spatialWeights.test.nhoods, nsim = 999)

grid.arrange(nrow = 2,
ggplot(as.data.frame(moranTest.bl$res[c(1:999)]), aes(moranTest.bl$res[c(1:999)])) +
  geom_histogram(binwidth = 0.01) +
  geom_vline(aes(xintercept = moranTest.bl$statistic), colour = "#FA7800",size=1) +
  scale_x_continuous(limits = c(-1, 1)) +
  labs(title="Observed and permuted Moran's I - Baseline Regression",
       subtitle= "Observed Moran's I in orange",
       x="Moran's I",
       y="Count") + 
  plotTheme(),
ggplot(as.data.frame(moranTest.nhoods$res[c(1:999)]), aes(moranTest.nhoods$res[c(1:999)])) +
  geom_histogram(binwidth = 0.01) +
  geom_vline(aes(xintercept = moranTest.nhoods$statistic), colour = "#FA7800",size=1) +
  scale_x_continuous(limits = c(-1, 1)) +
  labs(title="Observed and permuted Moran's I - Neighborhood Effects",
       subtitle= "Observed Moran's I in orange",
       x="Moran's I",
       y="Count", caption = 'Figure 4.7') + 
  plotTheme()
)

rbind(c(" ", "Moran's I", "P-value"), 
      c("Baseline Regression", moranTest.bl$statistic, moranTest.bl$p.value),
      c("Neighborhood Effects", moranTest.nhoods$statistic, moranTest.nhoods$p.value)) %>%
        kbl(caption = 'Cross Validation - Neighborhood Effects Regression Prediction Results') %>%
  kable_classic_2(html_font = 'Cambria') %>%
  footnote(general_title = "\n",
           general = "Table 4.4")
Cross Validation - Neighborhood Effects Regression Prediction Results
statistic
Moran's I P-value
Baseline Regression 0.07417081242862 0.004
Neighborhood Effects 0.0812021365829354 0.001

Table 4.4

Neighborhood

# Predict Home Prices for all 
miami.preds.all.bl <- miami.organize %>%
  mutate(SalePrice.Predict = predict(reg.bl, miami.organize))

ggplot() +
  geom_sf(data = nhoods, fill = "azure") +
  geom_sf(data = miami.preds.all.bl, aes(colour = q5(SalePrice.Predict)),
          show.legend = "point", size = 0.75) +
  scale_colour_manual(values = palette5,
                      labels=qBr(miami.preds.all.bl,"SalePrice.Predict"),
                      name="Quintile\nBreaks") +
  labs(title="Predicted Home Prices",
       subtitle = "Both datasets for training and predicting - Baseline Regression",
       caption = 'Figure 4.8.1 | Data source: Zillow, OpenStreetMap, Miami-Dade County Open Data, US Census Bureau') +
  mapTheme()

# Predict Home Prices for all 
miami.preds.all.nhoods <- miami.organize.n %>%
  mutate(SalePrice.Predict = predict(reg.nhoods, miami.organize.n))

ggplot() +
  geom_sf(data = nhoods, fill = "azure") +
  geom_sf(data = miami.preds.all.nhoods, aes(colour = q5(SalePrice.Predict)),
          show.legend = "point", size = 0.75) +
  scale_colour_manual(values = palette5,
                      labels=qBr(miami.preds.all.nhoods,"SalePrice.Predict"),
                      name="Quintile\nBreaks") +
  labs(title="Predicted Home Prices - Neighborhood Effects",
       subtitle = "Both datasets for training and predicting - Baseline Regression",
       caption = 'Figure 4.8.2 | Data source: Zillow, OpenStreetMap, Miami-Dade County Open Data, US Census Bureau') +
  mapTheme()

miami.test.bl <- st_join(miami.test.bl, nhoods)
nhood_sum.bl <- miami.test.bl%>% 
    group_by(NAME) %>%
    summarize(meanPrice = mean(SalePrice, na.rm = T),
              meanPrediction = mean(SalePrice.Predict, na.rm = T),
              meanMAE = mean(SalePrice.AbsError, na.rm = T))

nhoods_MAPE.bl<-
  miami.test.bl %>%
  group_by(NAME) %>%
  summarize(mean.MAPE = mean(SalePrice.APE, na.rm = T),
            mean.SalePrice = mean(SalePrice, na.rm = T)) %>%
  st_set_geometry(NULL) %>%
  left_join(nhoods) %>%
  st_sf()

ggplot() + 
  geom_sf(data = nhoods, fill = "azure", size = 0.4) +
  geom_sf(data = nhoods_MAPE.bl, aes(fill = q5(mean.MAPE))) +
  geom_sf(data = miami.test, colour = "black", size = .5) +
  scale_fill_manual(values = palette5,
                    labels = qBr(nhoods_MAPE.bl, "mean.MAPE", rnd = F),
                    name="Quintile\nBreaks") + 
  labs(title = "Mean Test Set MAPE by Neighborhoods",
       subtitle = "Baseline Regression",
       caption = 'Figure 4.9.1 | Data source: Zillow, OpenStreetMap, Miami-Dade County Open Data, US Census Bureau') +
  mapTheme()

nhood_sum.nhoods <- miami.test.nhoods%>% 
    group_by(NAME) %>%
    summarize(meanPrice = mean(SalePrice, na.rm = T),
              meanPrediction = mean(SalePrice.Predict, na.rm = T),
              meanMAE = mean(SalePrice.AbsError, na.rm = T))

nhoods_MAPE.nhoods<-
  miami.test.nhoods %>%
  group_by(NAME) %>%
  summarize(mean.MAPE = mean(SalePrice.APE, na.rm = T),
            mean.SalePrice = mean(SalePrice, na.rm = T)) %>%
  st_set_geometry(NULL) %>%
  left_join(nhoods) %>%
  st_sf()

ggplot() + 
  geom_sf(data = nhoods, fill = "azure", size = 0.4) +
  geom_sf(data = nhoods_MAPE.nhoods, aes(fill = q5(mean.MAPE))) +
  geom_sf(data = miami.test, colour = "black", size = .5) +
  scale_fill_manual(values = palette5,
                    labels = qBr(nhoods_MAPE.nhoods, "mean.MAPE", rnd = F),
                    name="Quintile\nBreaks") + 
  labs(title = "Mean Test Set MAPE by Neighborhoods",
       subtitle = "Neighborhood Effects",
       caption = 'Figure 4.9.2 | Data source: Zillow, OpenStreetMap, Miami-Dade County Open Data, US Census Bureau') +
  mapTheme()

grid.arrange(nrow = 2,
ggplot(nhoods_MAPE.bl, aes(mean.MAPE, mean.SalePrice)) +
  geom_point(size = 2, colour = "red") +
  labs(title="MAPE by Neighborhood as a Function of Mean Price by Neighborhoods",
       subtitle = "Baseline Regression") +
  xlim(0, 1.5)+
  plotTheme(),
ggplot(nhoods_MAPE.nhoods, aes(mean.MAPE, mean.SalePrice)) +
  geom_point(size = 2, colour = "red") +
  labs(title="MAPE by Neighborhood as a Function of Mean Price by Neighborhoods",
       subtitle = "Neighborhood Effects", caption = 'Figure 4.10') +
  xlim(0, 1.5)+
  plotTheme()
)

For both regression models, the mean MAPE across Neighborhood groups seemed to spread out the space, though most points situated under $500,000 in price and resided in the MAPE range of 0.2~0.7.

bothRegressions <- 
  rbind(
    dplyr::select(miami.test.bl, starts_with("SalePrice"), Regression, NAME) %>%
      mutate(lagPriceError = lag.listw(spatialWeights.test.bl, SalePrice.Error)),
    dplyr::select(miami.test.nhoods, starts_with("SalePrice"), Regression, NAME) %>%
      mutate(lagPriceError = lag.listw(spatialWeights.test.nhoods, SalePrice.Error)))    

st_drop_geometry(bothRegressions) %>%
  gather(Variable, Value, -Regression, -NAME) %>%
  filter(Variable == "SalePrice.AbsError" | Variable == "SalePrice.APE") %>%
  group_by(Regression, Variable) %>%
    summarize(meanValue = mean(Value, na.rm = T)) %>%
    spread(Variable, meanValue) %>%
    kable(caption = 'Neighborhood Group Test') %>%
      kable_classic_2(html_font = 'Cambria') %>% 
  footnote(general_title = "\n",
           general = "Table 4.5")
Neighborhood Group Test
Regression SalePrice.AbsError SalePrice.APE
Baseline Regression 273051.1 0.4891427
Neighborhood Effects 274115.7 0.5089556

Table 4.5

From the perspective of neighborhood groups, the overall errors and percentages errors were shown in the table. We might consider the Baseline Regression model better catch the neighborhood patterns, only a little bit.

Generalizability Test

Is the model generalizable? Some models often returns high-accuracy results but can only be applied to a few selected training regions without the potential to fit the data outside of the training groups. To answer the question, we split the Miami/Miami Beach into groups:

  • Groups 1: Majority White VS Majority Non-White

  • Groups 2: High Income VS Low Income

tract18_gen_test <- tracts18_MDCounty %>%
  mutate(raceContext = ifelse(pctWhite > .5, "Majority White", "Majority Non-White"),
         incomeContext = ifelse(MedHHInc.inf > 50000, "High Income", "Low Income"))

grid.arrange(ncol = 2,
             ggplot() + geom_sf(data = na.omit(tract18_gen_test), aes(fill = raceContext)) +
               scale_fill_manual(values = c("#25CB10", "#FA7800"), name="Race Context") + 
               labs(title = "Race Context",
                    caption = 'Data source: US Census Bureau') + 
               mapTheme() + theme(legend.position='bottom'),
             ggplot() + geom_sf(data = na.omit(tract18_gen_test), aes(fill = incomeContext)) +
               scale_fill_manual(values = c("#25CB10", "#FA7800"), name="Income Context") + 
               labs(title = "Income Context",
                    caption = 'Data source: US Census Bureau') + 
               mapTheme() + theme(legend.position='bottom'))

Test set MAPE by neighborhood income context
Regression High Income Low Income <NA>
Baseline Regression 47% 50% 28%
Neighborhood Effects 47% 54% 25%
Test set MAPE by neighborhood racial context
Regression Majority Non-White Majority White
Baseline Regression 57% 47%
Neighborhood Effects 64% 47%

Table 4.6

For all income (except for N/A from which we could not extract valid data) and racial groups we explored here, the Baseline Regression model presented lower Mean Absolute Percent Error than the Neighborhood Effects model. And it seemed that both models performed better in the high-income group and the majority-white group. It might worthy to take a deep look into the difference between income and racial groups regarding the spatial patterns in housing and purchasing choices. However, MAPEvalues of more than 40% indicated that we were still far from satisfying models for Zillow's usage in predicting the marketing prices of housing units.

Export Prediction Products

Finally we could agree on a regression model to produce our prediction productions.

Note: the final regression model takes all training data into account without splitting - like we did previously.

reg.final <- lm(SalePrice ~ ., data = st_drop_geometry(miami.train) %>% 
             dplyr::select (SalePrice, Density, pctAfrAm, pctAsian,DistSexOP, 
                           pctBachelors, office_nn5,
                           saleYear,Stories.cat,
                           lagPrice,  DistLibrary, 
                           CoastDist, Zoning.cat, Mailing, TOD, Pool,Elevator,Property.City))
# Regression without training/testing split 
secret_data <- filter(miami.organize, toPredict == 1)
secret_preds <- predict(reg.final, newdata = secret_data)

output_preds <- data.frame(prediction = secret_preds, Folio = secret_data$Folio, team_name = 'MinimallyReproducible')
write.csv(output_preds, "MinimallyReproducible.csv")

5. Discussion

  • This model effectively predicted most prices (with a few 'negative' exceptions possibly due to extremes and the randomness in model).

  • We explored some variables attempting to understand the inner spatial correlation between urban context and house pricesthe distance to 5 neareast education facilities, etc.. we found that the prices were significantly influenced by the distance to school, benefiting from quite scattered educational resources; we also noticed that as a city by the sea, Miami is a well-known tourism destination for its golden coast and we found a strong negative correlation between the distance to coast and home sale price. Houses close to the coast are almost all over 1 million dollars; in addition, we found old houses are well blended with new houses in Miami. Dominated by the influence of the coast, it seems that both of the old and new houses were sold at a relatively higher price. It's hard to tell the correlation with a glance of eye, although the model seemed to catch the "close-to-coast" tendency of higher housing prices.

  • Generally, the predictors have the potential to account for about 78~79% variations in prices with R-squared values from 0.775 to 0.792. And the spatial clustring of high-price homes along the shore line is evident (including almost the whole city of Miami Beach, which can be seen from both its name and our maps.) Note: in cross validation process the R-squared values were about 70%. The model might face a hardship when dealing with brand new data sets.

  • We have a housing unit with its predicted price of about 90K dollars...which might be considered as the failure of modeling fitting or an extreme value in the data set. Both the baseline regression and the neighborhood effects models have a relatively high MAPE (about 50%) through all.

  • Besides, the neighborhoods in Miami & Miami Beach are kinds of dispersed, and deviated a lot from the assigned census tracts, which might cause the inconsistency in the models when dealing with variables from both neighborhoods effects and census tracts. Also the Miami Beach as a whole neighborhood might generate more errors. Resources are scarce, and the socio-economic conditions, development could differ a lot within a city. More sophisiticated modelling power and data sets can improve the performance of predictions.

6. Conclusion

  • I’ll not recommend this model to Zillow because of two defects. Firstly, there is still a possibility for this model to generate several negative predictions. The early edition of our model output over half of negative predictions. We tried several functions to solve it, such as dropping predictors with collinearity or a large negative coefficient. But this model is still not able to guarantee one hundred percent of positive predictions. We assume this can be the huge price differences in Miami. We tried to add some features to promote the prediction value of luxurious houses, while it also severely drags down the value of houses without such a feature.

  • Besides, we primarily want to add the average price of the nearby houses as a spatial predictor. However, our current function will count in the house got zero in price, which may decrease the average number. And we didn’t find a solution to this bug for now.