Skip to content

gaoqian310/Housing_Analysis_with_Machine_Learning

Repository files navigation

Housing_Analysis_with_Machine_Learning

Data Science Project build through python

1. Defining the Project

2. Preparing the data

3. Exploring and visualizing the data

4. Creating a machine learning model

5. Presenting your findings

# Importing libraries
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
# Loading dataset
from sklearn.datasets import load_boston
housing = load_boston()
print(housing.keys())
dict_keys(['data', 'target', 'feature_names', 'DESCR', 'filename'])
print (housing.DESCR)
.. _boston_dataset:

Boston house prices dataset
---------------------------

**Data Set Characteristics:**

    :Number of Instances: 506

    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
        - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
        - LSTAT    % lower status of the population
        - MEDV     Median value of owner-occupied homes in $1000's

    :Missing Attribute Values: None

    :Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.
https://archive.ics.uci.edu/ml/machine-learning-databases/housing/


This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980.   N.B. Various transformations are used in the table on
pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression
problems.

.. topic:: References

   - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
   - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.
# Creating dataframe with features
housing_df = pd.DataFrame(housing.data, columns = housing.feature_names)
# Adding target variable to the dataset
housing_df['MEDV'] = housing.target
housing_df.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT MEDV
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98 24.0
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14 21.6
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03 34.7
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94 33.4
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33 36.2
housing_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 14 columns):
 #   Column   Non-Null Count  Dtype
---  ------   --------------  -----
 0   CRIM     506 non-null    float64
 1   ZN       506 non-null    float64
 2   INDUS    506 non-null    float64
 3   CHAS     506 non-null    float64
 4   NOX      506 non-null    float64
 5   RM       506 non-null    float64
 6   AGE      506 non-null    float64
 7   DIS      506 non-null    float64
 8   RAD      506 non-null    float64
 9   TAX      506 non-null    float64
 10  PTRATIO  506 non-null    float64
 11  B        506 non-null    float64
 12  LSTAT    506 non-null    float64
 13  MEDV     506 non-null    float64
dtypes: float64(14)
memory usage: 55.5 KB
housing_df.describe()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT MEDV
count 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000
mean 3.613524 11.363636 11.136779 0.069170 0.554695 6.284634 68.574901 3.795043 9.549407 408.237154 18.455534 356.674032 12.653063 22.532806
std 8.601545 23.322453 6.860353 0.253994 0.115878 0.702617 28.148861 2.105710 8.707259 168.537116 2.164946 91.294864 7.141062 9.197104
min 0.006320 0.000000 0.460000 0.000000 0.385000 3.561000 2.900000 1.129600 1.000000 187.000000 12.600000 0.320000 1.730000 5.000000
25% 0.082045 0.000000 5.190000 0.000000 0.449000 5.885500 45.025000 2.100175 4.000000 279.000000 17.400000 375.377500 6.950000 17.025000
50% 0.256510 0.000000 9.690000 0.000000 0.538000 6.208500 77.500000 3.207450 5.000000 330.000000 19.050000 391.440000 11.360000 21.200000
75% 3.677083 12.500000 18.100000 0.000000 0.624000 6.623500 94.075000 5.188425 24.000000 666.000000 20.200000 396.225000 16.955000 25.000000
max 88.976200 100.000000 27.740000 1.000000 0.871000 8.780000 100.000000 12.126500 24.000000 711.000000 22.000000 396.900000 37.970000 50.000000

What's the price distribution of the housing?

sns.set_theme(style="darkgrid")
plt.figure (figsize=(10,6))
sns.distplot(housing_df['MEDV'], axlabel = 'Median value of owner-occupied homes in $1000')
/Users/gaoqian/opt/anaconda3/lib/python3.8/site-packages/seaborn/distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)





<AxesSubplot:xlabel='Median value of owner-occupied homes in $1000', ylabel='Density'>

png

FutureWarning: distplot is a deprecated function and will be removed in a future version. Please adapt your code to use either displot (a figure-level function with similar flexibility) or histplot (an axes-level function for histograms).

Reproduce distplot with kdeplot and histplot

sns.set_theme(style="whitegrid")
plt.figure (figsize=(15,8))

ax = sns.kdeplot(data=housing_df['MEDV'])
ax.set(ylabel='density', xlabel='Median value of owner-occupied homes in $1000')

ax2 = ax.twinx()
ax2 = sns.histplot(housing_df['MEDV'], ax=ax2)
ax2.set_ylabel('count')

plt.tight_layout()
plt.show()

png

Correlation matrix using heatmap

# Correlation matrix
housing_corr = housing_df.corr()
plt.figure (figsize=(20,12))
sns.heatmap(housing_corr,annot = True, vmin= -1 , cmap = 'YlGnBu')
plt.show()

png

Conclusion

Strong negative correlation (-0.74) with % lower status of the population (LSTAT)

Strong positive correlation (0.7) with average number of rooms per dwelling (RM)

# Jointplots for high correlations - lower status population
plt.figure (figsize=(10,10))
sns.jointplot(x = 'LSTAT', y = 'MEDV', data = housing_df, kind = 'reg', height = 10, color = 'green')
plt.show()
<Figure size 720x720 with 0 Axes>

png

Conclusion

House pricing is negitave corelated to the lower status ofthe population. likely dependent on the lower status of the population.

  • LSTAT % lower status of the population
  • MEDV Median value of owner-occupied homes in $1000's
# Jointplots for high correlations - number of rooms
plt.figure (figsize=(10,10))
sns.jointplot(x = 'RM', y = 'MEDV', data = housing_df, kind = 'hex', color = 'teal', height = 10)
plt.show()
<Figure size 720x720 with 0 Axes>

png

Conclusion

House pricing is postive corelated to the number of rooms per dwelling.

Lots of houses with 6 rooms are at a price around $20k.

  • RM average number of rooms per dwelling
  • MEDV Median value of owner-occupied homes in $1000's

Creating a Machine Learning Model

  • Prepare the dataset
  • Split the dataset into training (75%) and test (25%) sets
# Preparing the dataset
X = housing_df.drop(['MEDV'], axis = 1)
Y = housing_df['MEDV']
# Splitting into training and test sets
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.25, random_state=100)

First Model: Linear Regression

# Training the Linear Regression model
from sklearn.linear_model import LinearRegression

lm = LinearRegression()
lm.fit(X_train, Y_train)
LinearRegression()
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
LinearRegression(n_jobs=1)
Using Root-mean-square error (RMSE) and R squared (r2_score) to evaluate the model
# Evaluating the Linear Regression model for the test set
from sklearn.metrics import mean_squared_error, r2_score
predictions = lm.predict(X_test)
RMSE_lm = np.sqrt(mean_squared_error(Y_test, predictions))
r2_lm = r2_score(Y_test, predictions)

print('RMSE_lm = {}'.format(RMSE_lm))
print('R2_lm = {}'.format(r2_lm))
RMSE_lm = 5.212786603443838
R2_lm = 0.7246154314616744
Conclusion

This model gives us an RMSE of about 5.2.

R squared value of 0.72 means that this linear model explains 72% of the total response variable variation.

Second Model: Random Forest

# Training the Random Forest model
from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor(n_estimators = 10, random_state = 100)
rf.fit(X_train, Y_train)
RandomForestRegressor(n_estimators=10, random_state=100)
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
oob_score=False, random_state=100, verbose=0, warm_start=False)
RandomForestRegressor(n_estimators=10, n_jobs=1, random_state=100)
# Evaluating the Random Forest model for the test set
predictions_rf = rf.predict(X_test)
RMSE_rf = np.sqrt(mean_squared_error(Y_test, predictions_rf))
r2_rf = r2_score(Y_test, predictions_rf)

print('RMSE_rf = {}'.format(RMSE_rf))
print('R2_rf = {}'.format(r2_rf))
RMSE_rf = 3.5880185569969294
R2_rf = 0.8695304788426546

Conclusion

This random forest model is much better one, since teh error is lower, RMSE = 3.6, and the share of explained variation is significantly higher, R2 = 0.87.

Credit

Developing Data Science Projects in Python, find more

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published