Skip to content

brendanoct/nyc-house-prices

Repository files navigation

Predicting House Sale Prices in NYC

Files

Predicting-House-Sale-Prices-in-NYC_files/figure-gfm/: Contains images to include in README.md.

NYC Property Data 2021 Cleaned.csv: Cleaned data set that was used throughout analysis

Predicting House Sale Prices in NYC.Rmd: Original .Rmd file where I created code

Predicting-House-Sale-Prices-in-NYC.md: GitHub friendly version

Introduction

New York City is not just the city that we know it all for based on movies - it is a very large city with a lot to offer for those who live there. With the 5 different boroughs in New York City (Manhattan, Bronx, Brooklyn, Queens and Staten Island), there are almost 9 million residents, all of whom need somehwere to reside.

Coming from Australia, due to the COVID-19 pandemic and work from home orders, there have been a large number of people choosing to move away from major cities in cramped apartments, and opt to live in a much more spacious home. Due to the rapidly increasing prices for inner-city residence, and also the desire to live a more spacious lifestyle, the Australian Bureau of Statistics confirmed that capital cities saw a net loss of 11,200 people in the September quarter of 2020 alone.

Our Data

Analysing data from the NYC Department of Finance, it would be interesting to explore if a related trend is occurring in New York City through the following question:

  1. How well does the size of a house explain or predict sale price across New York City?

NYC Department of Finance uploaded the data for each individual borough, so it was important to clean the data and merge it all so that we could perform a meaningful analysis. We will be focusing specifically on houses who fall under the “A1” category, which are two story homes that are not connected to any other residence, and considered small or medium, with or without an attic. Unfortunately, but also unsurprisingly, Manhattan only had 3 of these kinds of homes, therefore it has been excluded from our analysis.

#Loading our packages
set.seed(1)
library(tidyverse)
library(knitr)
library(janitor)
library(readxl)
library(tidymodels)

#Loading the data
#Manhattan 
manhattan <- read_excel("rollingsales_manhattan.xlsx",
                        #Using skip = 4 to skip the first 4 rows with unnecessary data
                        skip = 4)
#Bronx
bronx <- read_excel("rollingsales_bronx.xlsx",
                    skip = 4)
#Brooklyn
brooklyn <- read_excel("rollingsales_brooklyn.xlsx",
                       skip = 4)
#Queens
queens <- read_excel("rollingsales_queens.xlsx",
                     skip = 4)
#Staten Island
staten_island <- read_excel("rollingsales_statenisland.xlsx",
                            skip = 4)

#Creating our merged data
nyc_property_data <- bind_rows(manhattan, bronx, brooklyn, queens, staten_island) %>%
  #Ensuring all of our column names are lower case and separated by "_"
  clean_names() %>%
  mutate(
    #Our data has each borough labelled with a number. We will use case_when() to change each number to the
    #corresponding borough name
    borough = case_when(borough == 1 ~ "manhattan",
                        borough == 2 ~ "bronx",
                        borough == 3 ~ "brooklyn",
                        borough == 4 ~ "queens",
                        borough == 5 ~ "staten_island"),
    #Since the below columns are all capitalised, we will ensure they are in "Title Case" form
    neighborhood = str_to_title(neighborhood),
    building_class_category = str_to_title(building_class_category),
    address = str_to_title(address)) %>%
  #Removing columns that are not needed for our analysis
  select(-easement, -apartment_number) %>%
  #Removing any houses sold below $10,000. These are assumed to be transfers of ownership
  filter(sale_price > 10000) %>%
  #Removing any duplicates
  distinct() %>%
  #Removing any observations where we do not have information for our important columns
  drop_na(c("sale_price", "gross_square_feet")) %>%
  #Ordering data alphabetically by borough
  arrange(borough) %>%
  #Creating a new .csv file with our cleaned data
  write_csv("NYC Property Data 2021 Cleaned") 

#Creating a new dataframe containing just houses
nyc_houses <- nyc_property_data %>%
  #Under this column, the corresponding code for two story detached houses 
  #(small or moderate, with or without attic) is A1
  filter(building_class_at_time_of_sale == "A1",
         #Manhattan only has 3 houses so we will not be including them in our analysis
         borough != "manhattan")

nyc_houses %>%
  head() %>%
  kable
borough neighborhood building_class_category tax_class_at_present block lot building_class_at_present address zip_code residential_units commercial_units total_units land_square_feet gross_square_feet year_built tax_class_at_time_of_sale building_class_at_time_of_sale sale_price sale_date
bronx Bathgate 01 One Family Dwellings 1 3028 24 A1 410 East 179th Street 10457 1 0 1 1842 2048 1901 1 A1 600000 2021-01-15
bronx Bathgate 01 One Family Dwellings 1 3039 63 A1 469 East 185th Street 10458 1 0 1 1650 1296 1910 1 A1 455000 2021-12-23
bronx Bathgate 01 One Family Dwellings 1 3045 12 A1 2052 Bathgate Avenue 10457 1 0 1 2340 1516 1910 1 A1 580000 2021-07-22
bronx Baychester 01 One Family Dwellings 1 4714 32 A1 1409 Oakley Street 10469 1 0 1 2375 1395 1935 1 A1 540000 2021-07-20
bronx Baychester 01 One Family Dwellings 1 4729 10 A1 1278 East 222 Street 10469 1 0 1 3189 1408 1901 1 A1 265000 2021-08-20
bronx Baychester 01 One Family Dwellings 1 4767 18 A1 3025 Kingsland Avenue 10469 1 0 1 4750 4374 1920 1 A1 715000 2021-04-01

Methodology: Linear Regression and Cross Validation

Linear regression aims to use a linear approach to model the relationship between the dependent variable and one or more explanatory variable. The most common method for fitting a linear regression model is the least squares criterion, which is what the lm() function in R uses. By using this function, we are aiming to minimise the residual sum of squares (RSS) through the choices of our coefficients in our model. Intuitively, this makes sense, as we are aiming to model our relationship in such a way where our residuals (or in other words errors) are minimised, to get the most accurate explanation/prediction of our dependent variable.

After using linear regression to create our model, we can check how well our model is performing through the use of cross validation, or more specifically k-fold cross validation. Instead of doing a simple train/test split (where we split our data in such a way where we “train” our model on our training split, and then “test” the skill of model on our testing split), we are doing this multiple times. Through k-fold cross validation, we will be partitioning (fold) the data into k groups. Then, we will take one group as the test split, and the remainder as our training split. We will fit a model on this training split, then evaluate it on our test split. After retaining our metrics from this one train/test split, we will repeat the process, using our next group as the test split, then using the remainder again as our training split.

This is more ideal than one train/test split because it allows our model to train on multiple train/test splits, giving us a better indication of how well our model performs on unseen data, as well as a better estimate of our out-of-sample accuracy.

Model

Determining model

Firstly, after reviewing our data, and taking into account the question we posed earlier, we have decided to create plots to visualise the relationship between gross square feet and sale price, as well as land square feet and sale price.

As per the glossary provided by NYC Department of Finance, the gross square feet can be defined as “The total area of all the floors of a building as measured from the exterior surfaces of the outside walls of the building, including the land area and space within any building or structure on the property”, while the land square feet can be defined as “The land area of the property listed in square feet”.

gross_sqft_sale_price_plot <- nyc_houses %>% 
  mutate(Borough = factor(borough, 
                          labels = c("Bronx", "Brooklyn", "Queens", "Staten Island"))) %>%
  ggplot(aes(x = gross_square_feet,
             y = sale_price)) +
  geom_point(aes(colour = Borough)) +
  geom_smooth(method = "lm",
              se = FALSE) +
  labs(x = "Gross Square Feet",
       y = "Sale Price",
       title = "Gross Square Feet and Sale Price for each borough") +
  theme_bw() +
  scale_y_continuous(labels = scales::comma) +
  facet_wrap(~borough,
             scales = "free") 
land_sqrft_sale_price_plot <- nyc_houses %>% 
  mutate(Borough = factor(borough, 
                          labels = c("Bronx", "Brooklyn", "Queens", "Staten Island"))) %>% 
  ggplot(aes(x = land_square_feet,
             y = sale_price)) +
  geom_point(aes(colour = Borough)) +
  geom_smooth(method = "lm",
              se = FALSE) +
  labs(x = "Land Area in Square Feet",
       y = "Sale Price",
       title = "Land Area and Sale Price for each borough") +
  theme_bw() +
  scale_y_continuous(labels = scales::comma) +
  facet_wrap(~borough,
             scales = "free") 
gross_sqft_sale_price_plot

land_sqrft_sale_price_plot

tibble("Pearson's Correlation Coefficient for Gross Square Feet" = cor(nyc_houses$gross_square_feet, nyc_houses$sale_price)) %>%
  kable()
Pearson’s Correlation Coefficient for Gross Square Feet
0.5303782
tibble("Pearson's Correlation Coefficient for Land Square Feet" = cor(nyc_houses$land_square_feet, nyc_houses$sale_price)) %>%
  kable()
Pearson’s Correlation Coefficient for Land Square Feet
0.2842535

As we can see above, there is generally a moderate positive relationship between the gross square feet and sale price (with a correlation coefficient of 0.53), as well as a weak to moderate positive relationship between land square feet and sale price (with a correlation coefficient of 0.28). Besides the clear outliers, there is evidence here to suggest that there is a linear relationship between our variables.

As a result, we will be proceeding with a model that includes gross square feet, land square feet and borough as our explanatory variables.

Evaluating model

#Splitting the data
houses_split <- initial_split(nyc_houses)
#Using the split data to create our training set
houses_train <- training(houses_split)
#Using the split data to create our test set
houses_test <- testing(houses_split)

#Setting up our inference engine
spec <- linear_reg() %>%
  set_engine("lm")

#Fitting our model to the training data
lm_fit <- spec %>%
  fit(sale_price ~ gross_square_feet + land_square_feet + borough,
      data = houses_train)

lm_fit %>%
  tidy() %>%
  kable()
term estimate std.error statistic p.value
(Intercept) -68757.52620 27872.642262 -2.466846 0.0136753
gross_square_feet 355.23626 11.794547 30.118686 0.0000000
land_square_feet 39.28518 3.735677 10.516216 0.0000000
boroughbrooklyn 538455.15997 27289.883108 19.730944 0.0000000
boroughqueens 155469.51673 23415.042376 6.639728 0.0000000
boroughstaten_island -63864.08770 25649.993311 -2.489829 0.0128235

Looking at the p-values for all of our coefficients from our model, they are all below 0.05. As a result, if we were to complete a hypothesis test at a 5% level of significance, there is sufficient evidence to suggest that there is a relationship between gross square feet and sale price, land square feet and sale price and borough and sale price.

Therefore, our model moving forward is confirmed to be:

sale_price ~ gross_square_feet + land_square_feet + borough + year_built

Residuals of our model

After determining our model, and training our model on our training split, we will use it to test its skill on our test set. We will be looking at the residuals of our predictions. This is important to do so we can ensure our residuals are approximately normally distributed (appear as a bell curve shape). This is an assumption of a linear regression model, and if this is not the case, we are unable to make accurate inferences and predictions.

#We will use the predict function to create our predictions on the test set
#Then we will add this to our test set dataframe
houses_pred <- houses_test %>%
  bind_cols(predict(lm_fit, houses_test))

#Creating a new column containing our residuals
houses_pred <- houses_pred %>%
  mutate(residuals = sale_price - .pred)

#It is important to plot our residuals to see if they are normally distributed
residual_plot <- houses_pred %>%
  ggplot(aes(x = residuals)) +
  geom_histogram() +
  labs(x = "Residuals",
       y = "Count",
       title = "Residual Plot") +
  scale_x_continuous(labels = scales::comma) +
  theme_bw()

residual_plot

However, we can see that there is normality in the residuals (minus a few outliers), meaning that we can proceed with our analysis.

Evaluating Model

EDIT: I now understand when the RMSE or the MAE will be a more meaningful measure of error for models. RMSE penalises larger errors in our model since RMSE squares the errors first before taking the square root of them. Since our model aims to predict the sale prices of houses, higher errors are not as important as a scenario, for example, where patient’s lives are at risks, therefore the MAE is more useful in our situation.

We can also calculate certain metrics to tell us how well our model predicts the sale price of these houses.

houses_pred %>%
  metrics(truth = sale_price,
          estimate = .pred) %>%
  knitr::kable(digits = 2)
.metric .estimator .estimate
rmse standard 390327.6
rsq standard 0.4
mae standard 213907.5

The MAE tells us that on average, our model will predict the sale price of the house with a distance of $213,907.50, or in other words, we will predict the price with an average error of $213,907.50, either more or less.

Cross Validation

Finally, we will conduct cross validation to further check the skill of our model.

#Cross Validation

#Preparing our training dataframe that contains the cross validation partitions
folds <- vfold_cv(nyc_houses,
                  v = 10)

#Fitting the data on our cross validation folds
fit_cv <- fit_resamples(spec,
                        sale_price ~ gross_square_feet + land_square_feet + borough + year_built,
                        resamples = folds)

#Getting our average metrics across each fold
collect_metrics(fit_cv) %>%
    knitr::kable(digits = 2)
.metric .estimator mean n std_err .config
rmse standard 370127.03 10 22400.59 Preprocessor1_Model1
rsq standard 0.43 10 0.02 Preprocessor1_Model1

After performing cross validation, we have calculated the mean of each fold’s root mean squared error (RMSE) and R2 (RSQ). Firstly, our R2 tells us that approximately 43% of the variability in sale price can be explained by gross square feet, land square feet and which borough the house is in. While a number closer to 1 is more ideal, this doesn’t necessarily mean it is a bad model.

At the same time, looking at our average RMSE, it appears quite high. However, it is important to note that the sale prices of our houses range form $15,000 to $9,000,000, so to predict our house price wrong by approximately $370,000 isn’t terrible. EDIT: As stated above, while we have looked at the RMSE after cross-validation, our MAE is still a more meaningful measure for us.

Conclusion

After performing our analysis, we were able to gain an insight into variables that help explain, as well as predict house sale prices. Generally, as the gross area of the house, as well as the land area increases, the sale price tends to increase as well. This does vary by borough, and our model does a reasonable job exploring this.

With this analysis, we hope to create a model that can be useful for potential homebuyers in predicting house prices, particularly those looking to move into a home while still being considered within New York City. Our model could potentially be improved by including other variables - macroeconomic indicators like employment rates, interest rates and wage growth, number of bedrooms and bathrooms, proximity to shops and popular attractions, quality of schools and many more-, however by using the data we had available from NYC Department of Finance we were able to develop an adequate model that predicts sale prices, and minimises errors as best as possible.

About

Predicting House Sale Prices in NYC

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages