Motivation

The woefully inadequate uptake of the house-repairing tax credit program in Emil City has driven the city's Department of Housing and Community Development (HCD) to strive for a better understanding of the willingness of eligible homeowners and more effectively targeting resource allocation. Based on client-level data collected from previous campaigns, we developed a decision-making analytic.

Our analytics rely on the following criteria/assumptions: 1. For the household predicted to take the tax credit, HCD is willing to allocate $2,850 per homeowner which includes staff and resources to facilitate mailers, phone calls, and information/counseling sessions at the HCD offices. 2. We assume 25% of contacted eligible homeowners, i.e., who received the marketing allocation, would take the tax credit, while the remaining 75% would not take the credit. 3. The credit costs $5,000 per homeowner which can be used toward home improvement. Academic researchers in Philadelphia evaluated the program finding that houses that transacted after taking the credit, sold with a $10,000 premium, on average.

Setup

Libraries used in this project: tidyverse, kableExtra, caret, knitr, pscl, plotROC, pROC, lubridate. Please see the hiden code chunk for more info.

Functions/Data Source: Public Policy Analytics by Ken Steif

options(scipen=9999)

library(tidyverse)
library(kableExtra)
library(caret)
library(knitr) 
library(pscl)
library(plotROC)
library(pROC)
library(lubridate)
library(gridExtra)

palette5 <- c("#981FAC","#CB0F8B","#FF006A","#FE4C35","#FE9900")
palette4 <- c("#981FAC","#FF006A","#FE4C35","#FE9900")
palette2 <- c("#981FAC","#FF006A")

root.dir = "https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/DATA/"

source("https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/functions.r")

housingSubsidy <- read.csv(file.path(root.dir,"/Chapter6/housingSubsidy.csv"))

Exploratory Analysis

housingSubsidy %>%
  dplyr::select(y, age, previous, unemploy_rate, cons.price.idx, 
                cons.conf.idx, inflation_rate, spent_on_repairs) %>%
  gather(Variable, value, -y) %>%
  ggplot(aes(y, value, fill=y)) + 
    geom_bar(position = "dodge", stat = "summary", fun = "mean") + 
    facet_wrap(~Variable, scales = "free") +
    scale_fill_manual(values = palette2) +
    labs(x="Tax credit uptake", y="Mean", 
         title = "Feature associations with the likelihood of tax credit uptake",
         subtitle = "Continous Outcomes)") +
    plotTheme() + theme(legend.position = "none")
Figure 1.1(A)

Figure 1.1(A)

From Figure 1.1(A) above, we can identify that the changes in inflation_rate (daily indicator of inflation - on the day of contacting), previous (times that the household has been contacted previously), and unemploy_rate (unemployment indicator on the day of contacting) are more likely to correlate with a houseowner's decision on taking up the housing repair tax credit, as the mean values of those variables corresponding to "yes" and "no" of tax credit uptake differ a lot. The other variables have similar mean values across "yes" and "no".

housingSubsidy %>%
  dplyr::select(y,cons.conf.idx, inflation_rate, spent_on_repairs, cons.price.idx, age, campaign, unemploy_rate) %>%
  gather(Variable, value, -y) %>%
  ggplot() + 
  geom_density(aes(value, color=y), fill = "transparent") + 
  facet_wrap(~Variable, scales = "free") +
  scale_fill_manual(values = palette2) +
  labs(title = "Feature distributions credit uptake vs. no credit uptake",
       subtitle = "(continous outcomes)") + plotTheme()

Figure 1.1(B) Figure 1.1(B) indicates the distribution of "yes" and "no" of tax credit uptake among the continuous variables. The peaks and valleys of "yes" and "no" almost match. We are not sure at this point whether those continuous variables are siginificantly correlated with the decision of taking up the tax credit or not.

housingSubsidy %>%
  dplyr::select(y, taxbill_in_phl, mortgage, taxLien) %>%
  gather(Variable, value, -y) %>%
  count(Variable, value, y) %>%
  ggplot(aes(value, n, fill = y)) +   
    geom_bar(position = "dodge", stat="identity") +
    facet_wrap(~Variable, scales="free") +
    scale_fill_manual(values = palette2) +
    labs(x="Tax credit uptake", y="Count",
         title = "Feature associations with the likelihood of tax credit uptake",
         subtitle = "YES/NO/Unknown features; Purple - NO Credit Uptake, Red - Credit Uptake") +
    plotTheme() + theme(legend.position = 'none', axis.text.x = element_text(angle = 45, hjust = 1))

Figure 1.2 The three features, mortgage, taxbill_in_phl, and taxLien shown in Figure 2 stand for binary features (with a few exceptions of "unknown" answers). Most households in the housingSubsidy dataset did not take up the tax credit, and in each feature here there are more households not taking up the tax than those taking up. And note there are no "yes" answers of whether the household has a lien against its property. So that feature might not influence the selling prices (or whether to take up the tax credit).

housingSubsidy %>%
  dplyr::select(y, job, marital,
                education, contact, month, day_of_week, campaign,
                pdays, poutcome) %>%
  gather(Variable, value, -y) %>%
  count(Variable, value, y) %>%
    ggplot(aes(value, n, fill = y)) +   
      geom_bar(position = "dodge", stat="identity") +
      facet_wrap(~Variable, scales = "free", ncol=3) +
      scale_fill_manual(values = palette2) +
      labs(x="Tax credit uptake", y="Count",
           title = "Feature associations with the likelihood of tax credit uptake",
           subtitle = "Multiple Category Features; Purple - NO Credit Uptake, Red - Credit Uptake") +
      plotTheme() + theme(legend.position = "none", axis.text.x = element_text(angle = 30, hjust = 1))

Figure 1.3 Multiple category features generate a bit messy patterns. "Yes" or "No" on tax credit uptake evenly distribute accross day_of_week, which is not surprising, since there might not exist a "magic day" of marketing.poutcome is linked with pdays since most houseowners are in "999" for pdays and "nonexistent" for poutcome - both indicate that the client has not been contacted previously. We probably need to rely on the logistic regression model to understand the other variables' correlation with tax credit uptake.

Logistic Regression Model

A logistic regression model predicts a binary outcome - a 1 or a 0 - a Take Credit or a Not Take Credit and associates a coefficient that describes the change in the probability of the outcome given some change in the independent variable.

We first partition our data into a 65/35 training and test set (p = .65). We call these sets creditTrain and creditTest.

set.seed(34567)
trainIndex1 <- createDataPartition(housingSubsidy$y, p = .65,
                                  y = paste(housingSubsidy$taxLien, housingSubsidy$education),
                                  list = FALSE,
                                  times = 1)
creditTrain1 <- housingSubsidy[ trainIndex1,]
creditTest1  <- housingSubsidy[-trainIndex1,]

Baseline Model ("Kitchen Sink"?)

We run our model with the dependent variable y_numeric and we use most of our variables in the regression as our baseline model (See dplyr::select).

creditModel1 <- glm(y_numeric ~ .,
                  data=creditTrain1 %>% 
                    dplyr::select(-X, -y),
                  family="binomial" (link="logit"))

Feature-Engineered Model

We modified age and education into more general categories in order to increase the model's performance. Instead of listing ages of the homeowners in a continuous way, we categorize them into four main categories: "YOUNG" represents homeowner younger than 30 years old; "MIDDLE_YOUNG" stands for those between 30 and 45 years old; "MIDDLE" for those between 45 and 65 years old; and finally, "SENIOR" for those above 65. Similarly, the education attributes are classified into "BASIC" (basic.4y, basic.6y, basic.9y), "HIGHSCHOOL" (high.school), "HIGHER" (professional.course and university.degree), and "OTHER". With grouped attributes we may better catch the patterns of those homeowners on whether they would like to takeup the credit tax.

features_en <- housingSubsidy %>%
  mutate(ageGroup = case_when(age < 30 ~ "YOUNG",
                              age >= 30 & age < 45 ~ "MIDDLE_YOUNG",
                              age >= 45 & age < 65 ~ "MIDDLE",
                              age >= 65 ~ "SENIOR"),
         eduGroup = case_when(education == 'basic.4y' | education == 'basic.6y' | education == 'basic.9y' ~ "BASIC",
                              education == 'high.school' ~ "HIGHSCHOOL",
                              education == 'professional.course' | education == 'university.degree' ~ "HIGHER",
                              TRUE ~ 'OTHER')
         # unemploy_rate.cat = case_when(unemploy_rate >= 0 ~ 'UP', TRUE ~ "DOWN")
         # inflationGroup = case_when(inflation_rate >= 3 ~ 'HIGH_INF', 
         #                            TRUE ~ "LOW_INF")
         # pdays.cat = case_when(pdays < 999 ~ "CONTACTED",
         #                       pdays == 999 ~ "NEVER")
         )
features_en %>%
  dplyr::select(y, ageGroup, eduGroup) %>%
  gather(Variable, value, -y) %>%
  count(Variable, value, y) %>%
    ggplot(aes(value, n, fill = y)) +   
      geom_bar(position = "dodge", stat="identity") +
      facet_wrap(~Variable, scales = "free", ncol=3) +
      scale_fill_manual(values = palette2) +
      labs(x="Tax credit uptake", y="Count",
           title = "Feature associations with the likelihood of tax credit uptake",
           subtitle = "Engineered new features; Purple - NO Credit Uptake, Red - Credit Uptake") +
      plotTheme() + theme(legend.position = "None", axis.text.x = element_text(angle = 30, hjust = 1))
Figure 2.1

Figure 2.1

We removed some irrelevant variables that may not have much influence on a household's decision to take up the tax credit, as seen in the figures in the Explorative Analysis part.

Variables removed: poutcome, age, education, previous, mortgage, unemploy_rate

set.seed(345678)
trainIndex2 <- createDataPartition(features_en$y, p = .65,
                                   y = paste(features_en$eduGroup,
                                             features_en$ageGroup), 
                                  list = FALSE,
                                  times = 1)

creditTrain2 <- features_en[ trainIndex2,]
creditTest2  <- features_en[-trainIndex2,]

creditModel2 <- glm(y_numeric ~ .,
                  data=creditTrain2 %>% 
                    dplyr::select(-X, -y, -poutcome, -age, -education, -taxLien, 
                                  -previous, -mortgage, -unemploy_rate),
                  family="binomial" (link="logit"))

We create a dataframe of predictions for the 1440 observations in our test set, called testProbs.

These predictions are the estimated probabilities of tax credit uptake for these out-of-sample subjects.

testProbs1 <- data.frame(Outcome = as.factor(creditTest1$y_numeric),
                        Probs = predict(creditModel1, creditTest1, type= "response"))
testProbs2 <- data.frame(Outcome = as.factor(creditTest2$y_numeric),
                        Probs = predict(creditModel2, creditTest2, type= "response"))
testProbs1 <- 
  testProbs1 %>%
  mutate(predOutcome  = as.factor(ifelse(testProbs1$Probs > 0.5 , 1, 0)))
testProbs2 <- 
  testProbs2 %>%
  mutate(predOutcome  = as.factor(ifelse(testProbs2$Probs > 0.5 , 1, 0)))

For the "kitchen sink" model:

caret::confusionMatrix(testProbs1$predOutcome, testProbs1$Outcome, 
                       positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1242  133
##          1   24   33
##                                              
##                Accuracy : 0.8904             
##                  95% CI : (0.873, 0.9061)    
##     No Information Rate : 0.8841             
##     P-Value [Acc > NIR] : 0.2431             
##                                              
##                   Kappa : 0.2516             
##                                              
##  Mcnemar's Test P-Value : <0.0000000000000002
##                                              
##             Sensitivity : 0.19880            
##             Specificity : 0.98104            
##          Pos Pred Value : 0.57895            
##          Neg Pred Value : 0.90327            
##              Prevalence : 0.11592            
##          Detection Rate : 0.02304            
##    Detection Prevalence : 0.03980            
##       Balanced Accuracy : 0.58992            
##                                              
##        'Positive' Class : 1                  
## 

For the feature-engineered model:

caret::confusionMatrix(testProbs2$predOutcome, testProbs2$Outcome, 
                       positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1252  118
##          1   17   48
##                                               
##                Accuracy : 0.9059              
##                  95% CI : (0.8896, 0.9205)    
##     No Information Rate : 0.8843              
##     P-Value [Acc > NIR] : 0.00495             
##                                               
##                   Kappa : 0.3749              
##                                               
##  Mcnemar's Test P-Value : < 0.0000000000000002
##                                               
##             Sensitivity : 0.28916             
##             Specificity : 0.98660             
##          Pos Pred Value : 0.73846             
##          Neg Pred Value : 0.91387             
##              Prevalence : 0.11568             
##          Detection Rate : 0.03345             
##    Detection Prevalence : 0.04530             
##       Balanced Accuracy : 0.63788             
##                                               
##        'Positive' Class : 1                   
## 

The sensitivity increases by about 0.1 (45.4%) and the specificity increases by about 0.05 (0.5%). Note that the sensitivity of a model represents the proportion of actual positives (1's) that were predicted to be positive and the specificity of a model represents the proportion of actual negatives (0's) that were predicted to be negatives. We will study the outcomes further later in this project.

# Regression results output
library(sjPlot)
library(sjmisc)
library(sjlabelled)

tab_model(creditModel1, creditModel2, auto.label = TRUE, show.ci = FALSE, show.intercept = FALSE, 
          title = 'Summary of Logistic Regression Results',
          dv.labels = c("Kitchen Sink", "Feature Engineering"),
          string.pred = "Coefficient")

We exponentiate a coefficeint and produce an 'odds ratio', which can be explained as: "All else equal, a person having {the coefficient} reduces the likelihood of taking up the tax credit by {odds ratio} .For continuous variables, a unit increase in the independent variable increase the likelihood of tax credit uptake by a certain percentage.

Cross Validation

ctrl <- trainControl(method = "cv", number = 100, 
                     classProbs=TRUE, summaryFunction=twoClassSummary)

cvFit1 <- train(y~., data = housingSubsidy %>% 
                                   dplyr::select(-X, -y_numeric), 
                method="glm", family="binomial",
                metric="ROC", trControl = ctrl)

For the "kitchen sink" all-in model:

## Generalized Linear Model 
## 
## 4119 samples
##   19 predictor
##    2 classes: 'no', 'yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (100 fold) 
## Summary of sample sizes: 4078, 4077, 4079, 4079, 4077, 4077, ... 
## Resampling results:
## 
##   ROC        Sens       Spec 
##   0.7674159  0.9806006  0.221
cvFit2 <- train(y~., data = features_en %>% 
                                   dplyr::select(-X, -y_numeric, -poutcome, -age, -education, -taxLien, 
                                  -previous, -mortgage, -unemploy_rate), 
                method="glm", family="binomial",
                metric="ROC", trControl = ctrl)

For the feature engineering model:

## Generalized Linear Model 
## 
## 4119 samples
##   14 predictor
##    2 classes: 'no', 'yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (100 fold) 
## Summary of sample sizes: 4078, 4079, 4079, 4079, 4079, 4078, ... 
## Resampling results:
## 
##   ROC        Sens       Spec  
##   0.7651269  0.9828529  0.2285

The facetted plots of ROC, Sensitivity and Specificity of both models are shown below.

dplyr::select(cvFit1$resample, -Resample) %>%
  gather(metric, value) %>%
  left_join(gather(cvFit1$results[2:4], metric, mean)) %>%
  ggplot(aes(value)) + 
    geom_histogram(bins=35, fill = "#FF006A") +
    facet_wrap(~metric) +
    geom_vline(aes(xintercept = mean), colour = "#981FAC", linetype = 3, size = 1.5) +
    scale_x_continuous(limits = c(0, 1)) +
    ylim(low = 0, high = 60) +
    labs(x="Goodness of Fit", y="Count", title="CV Goodness of Fit Metrics - Kitchen Sink",
         subtitle = "Across-fold mean reprented as dotted lines")

Figure 2.2(A)

The 100-fold cross validation results of the two models seem similar in terms of the shape of ROC, Sensitivity and Specificity distribution. For ROC, the second model (Figure 2.2(B)) is less dispersed around the peak value than the first model does. For Sensitivity the peak value of the first model is higher, though the two models have very close outcomes. For Specificity, the peak of the first model is above its mean while that of the second model is below its mean.

dplyr::select(cvFit2$resample, -Resample) %>%
  gather(metric, value) %>%
  left_join(gather(cvFit2$results[2:4], metric, mean)) %>%
  ggplot(aes(value)) + 
    geom_histogram(bins=35, fill = "#FF006A") +
    facet_wrap(~metric) +
    geom_vline(aes(xintercept = mean), colour = "#981FAC", linetype = 3, size = 1.5) +
    scale_x_continuous(limits = c(0, 1)) +
    ylim(low = 0, high = 60) +
    labs(x="Goodness of Fit", y="Count", title="CV Goodness of Fit Metrics - Feature Engineering",
         subtitle = "Across-fold mean reprented as dotted lines")

Figure 2.2(B)

Receiver Operating Characteristic (ROC) Curve

ggplot(testProbs2, aes(d = as.numeric(testProbs2$Outcome), m = Probs)) +
  geom_roc(n.cuts = 50, labels = FALSE, colour = "#FE9900") +
  style_roc(theme = theme_grey) +
  geom_abline(slope = 1, intercept = 0, size = 1.5, color = 'grey') +
  labs(title = "ROC Curve - Tax Credit Uptake Model")

auc(testProbs2$Outcome, testProbs2$Probs)
## Area under the curve: 0.7926

Figure 2.3

The Receiver Operating Characteristic (ROC) curve illustrates trade-offs between True Positive and False Positive - two vital confusion metrics and offers a measurement of goodness of fit indicator. The curve is above the diagonal line (also known as the 'Coin Flip line' with an equal, corresponding false positive rate for each true positive rate on the line). The model would get it right about 70% of time and 25% of time false.

The 'Area Under the Curve' metric or AUC for the feature-engineered model is 0.7926, which is reasonable since the range of AUC should be [0.5, 1].

Cost-Benefit Analysis

Here we define the cost/benefit equations of the four kinds of outcome (confusion metric) as follows:

True_Negative: Count * 0 (no campaign cost and no benefit).

True_Positive: Count * 0.25 *(10000 (premium when sold) + 56000 (surrounding aggregate premium) - 2850 (campaign cost) - 5000 (credit))

False_Negative: Count * 0 (zero out mannualy, not related to the campaign).

False_Positive: Count * (-2850) (campaign cost).

cost_benefit_table <-
   testProbs2 %>%
      count(predOutcome, Outcome) %>%
      summarize(True_Negative = sum(n[predOutcome==0 & Outcome==0]),
                True_Positive = sum(n[predOutcome==1 & Outcome==1]),
                False_Negative = sum(n[predOutcome==0 & Outcome==1]),
                False_Positive = sum(n[predOutcome==1 & Outcome==0])) %>%
       gather(Variable, Count) %>%
       mutate(Revenue =
               ifelse(Variable == "True_Negative", Count * 0,
               ifelse(Variable == "True_Positive", (10000+56000-5000-2850) * Count * 0.25,
               ifelse(Variable == "False_Negative", Count * 0,
               ifelse(Variable == "False_Positive", (-2850) * Count, 0))))) %>%
    bind_cols(data.frame(Description = c(
              "Predicted correctly homeowner would not take the credit, no marketing resources were allocated, and no credit was allocated", 
              "Predicted correctly homeowner would take the credit; allocated the marketing resources, and 25% took the credit",
              "We predicted that a homeowner would not take the credit but they did - probably unrelated to the marketing campaign",
              "Predicted incorrectly homeowner would take the credit; allocated marketing resources; no credit allocated")))
kable(cost_benefit_table,
       caption = "Cost/Benefit Table") %>%   kable_styling(font_size = 12, full_width = F,
                bootstrap_options = c("striped", "hover", "condensed"))
Cost/Benefit Table
Variable Count Revenue Description
True_Negative 1252 0 Predicted correctly homeowner would not take the credit, no marketing resources were allocated, and no credit was allocated
True_Positive 48 697800 Predicted correctly homeowner would take the credit; allocated the marketing resources, and 25% took the credit
False_Negative 118 0 We predicted that a homeowner would not take the credit but they did - probably unrelated to the marketing campaign
False_Positive 17 -48450 Predicted incorrectly homeowner would take the credit; allocated marketing resources; no credit allocated

Here is a function to iterate through the thresholds.

iterateThresholds <- function(data) {
  x = .01
  all_prediction <- data.frame()
  while (x <= 1) {
    this_prediction <-
        testProbs2 %>%
        mutate(predOutcome = ifelse(Probs > x, 1, 0)) %>%
        count(predOutcome, Outcome) %>%
        summarize(True_Negative = sum(n[predOutcome==0 & Outcome==0]),
                  True_Positive = sum(n[predOutcome==1 & Outcome==1]),
                  False_Negative = sum(n[predOutcome==0 & Outcome==1]),
                  False_Positive = sum(n[predOutcome==1 & Outcome==0])) %>%
       gather(Variable, Count) %>%
       mutate(Revenue =
               ifelse(Variable == "True_Negative", Count * 0,
               ifelse(Variable == "True_Positive", (10000+56000-5000-2850) * Count * 0.25,
               ifelse(Variable == "False_Negative", Count * 0,
               ifelse(Variable == "False_Positive", (-2850) * Count, 0)))),
              Threshold = x)
    
    all_prediction <- rbind(all_prediction, this_prediction)
    x <- x + .01
  }
return(all_prediction)
}
whichThreshold <- iterateThresholds(testProbs2)
whichThreshold %>%
  ggplot(.,aes(Threshold, Revenue, colour = Variable)) +
  geom_point() +
  scale_colour_manual(values = palette5[c(5, 1:3)]) +    
  labs(title = "Profit by confusion matrix type and threshold",
       y = "Profit") +
  plotTheme() +
  guides(colour=guide_legend(title = "Confusion Matrix")) 

Figure 3.1

We would like to create plots showing Threshold as a function of Total_Revenue and Total_Count_of_Credits respectively.

whichThreshold_sum <- 
  whichThreshold %>% 
    mutate(actualCredit = ifelse(Variable == "True_Positive", (Count * .25),
                          ifelse(Variable == "False_Negative", Count, 0) )) %>%  
   group_by(Threshold) %>% 
    summarize(Total_Revenue = sum(Revenue),
              Total_Count_of_Credits = sum(actualCredit)) 

optimalThreshold <- pull(arrange(whichThreshold_sum, -Total_Revenue)[1,1])
fiftyPercentThreshold <- 0.5

whichThreshold_table <- subset(whichThreshold_sum, Threshold == as.character(fiftyPercentThreshold) | 
                                 Threshold == as.character(optimalThreshold)) %>% kable() %>%
  kable_styling(font_size = 12, full_width = F,
                bootstrap_options = c("striped", "hover", "condensed")) 

whichThreshold_table
Threshold Total_Revenue Total_Count_of_Credits
0.13 1052475 89.5
0.50 649350 130.0

A threshold of 13% is optimal and yields the greatest revenue at $105,2475. After that mark, losses associated with False Negatives begin to mount (see Figure 3.2 below). However, the total count of credits keep increasing.

grid.arrange(ncol = 2,
  ggplot(whichThreshold_sum)+
    geom_line(aes(x = Threshold, y = Total_Revenue))+
    geom_vline(xintercept =  optimalThreshold)+
    labs(title = "Model Revenues and Counts of Credit By Threshold - Test",
         subtitle = "Vertical Line Denotes Optimal Threshold") + plotTheme(),
  ggplot(whichThreshold_sum)+
    geom_line(aes(x = Threshold, y = Total_Count_of_Credits))+
    geom_vline(xintercept =  optimalThreshold)+
    labs(title = ' ', subtitle = "Vertical Line Denotes Optimal Threshold") + plotTheme())

Figure 3.2

Above the optimal threshold of 0.13, the predicted revenue would plummet until 0, although the revenue keeps increaseing from a negative value to the maximum value above 1 million. The total count of credits does maintain a growing pattern from the beginning.

Conclusion

Personally speaking, we would not recommend this model be adopted into production. According to the optimal threshold outcome, this model produces more revenue ($1,052,475) than the default 50% threshold does ($649,350) and that may contribute to better performance of the housing subsidy tax credit program running in current mode. Besides, the model works greatly at predicting True Negatives instead of True Positives and False Positives. In other words, it helps to reduce potential losses or wrongly-targeted marketing investment (campaigns, publicity, etc.) but consider the aim of maximing profits, I would deem the prediction of True Positives as the ultimate goal. More importantly, the model struggles a lot in its sensitivity (from single sample testing), but it might be a dilemma to achieve that since we have limited number of homeowners who actually took up the housing subsidy tax credit program.