Skip to content

This project provides an end-to-end machine learning workflow to predict house sale prices using a dataset from King County.

License

Notifications You must be signed in to change notification settings

Anirudh-Unni/KingsCounty-HousePrices

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Predicting House Prices in King County, WA

Project Overview

This project provides an end-to-end machine learning workflow to predict house sale prices using a dataset from King County, WA, spanning May 2014 to May 2015. The analysis moves from initial data exploration to feature engineering, model comparison, and hyperparameter tuning to identify the most accurate prediction model and the key drivers of house value.

Dataset

Key Questions Investigated

This analysis seeks to answer several key questions about house pricing in King County:

  1. What are the primary drivers of house prices in King County?

  2. How much can we improve model accuracy through feature engineering?

  3. Which machine learning model provides the most accurate predictions for this dataset?

Dataset

The dataset contains 21,613 records of house sales in King County, WA. It includes 21 features for each house, such as square footage, number of bedrooms/bathrooms, location coordinates, and grade.

Target Variable: price

Key Features: sqft_living, grade, lat, long, zipcode, yr_built.

Analytical Workflow

The project is structured into a clear, multi-step process:

Exploratory Data Analysis (EDA): We began by visualizing the data to understand its distribution and identify key relationships. A major finding was the significant right-skew of the price variable and strong correlations between price and features like sqft_living and grade.

Baseline Modeling: To establish a performance benchmark, we first trained several models (Linear Regression, XGBoost, etc.) on the raw, unprocessed data. This gave us an initial R² score of ~0.80.

Feature Engineering: This was the most critical phase. We transformed the raw data to create more meaningful features for our models:

Temporal Features: Created age and years_since_renovation from timestamp columns.

Categorical Features: Applied one-hot encoding to the zipcode column to treat each location as a distinct category.

Target Transformation: Applied a log transform (np.log1p) to the price variable to normalize its distribution, which significantly improved the performance of many models.

Advanced Model Comparison: We re-ran a comprehensive suite of regression models (including Random Forest, Gradient Boosting, and XGBoost) on the newly engineered data to compare their performance.

Hyperparameter Tuning: We took the top-performing model, XGBoost, and used GridSearchCV to fine-tune its internal parameters, further optimizing its predictive accuracy.

Feature Importance Analysis: Finally, we inspected our best model to identify which features it found most influential in determining house prices.

Key Findings & Results

Feature Engineering is Crucial: The single most impactful step was feature engineering. It improved the R² score from a baseline of ~0.80 to ~0.87, demonstrating its critical importance.

Best Model: XGBoost Regressor was the clear winner, consistently outperforming all other models both before and after hyperparameter tuning.

Most Important Features: The final model revealed that the most significant drivers of house prices are:

sqft_living (Overall size)

grade (Construction quality)

lat & long (Geographic location)

age (The age of the house)

How to Run This Project

To replicate this analysis, please follow these steps:

Clone the repository:

git clone [your-repo-link]

Install dependencies: Ensure you have the necessary Python libraries installed.

pip install pandas numpy scikit-learn xgboost seaborn matplotlib jupyter

Launch Jupyter Notebook:

jupyter notebook

Run the main.ipynb notebook: Open the notebook and run the cells sequentially to see the full analysis. The dataset king_ country_ houses_aa.csv must be in the same directory.

About

This project provides an end-to-end machine learning workflow to predict house sale prices using a dataset from King County.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published