Predicting House Prices in King County, WA

Project Overview

This project provides an end-to-end machine learning workflow to predict house sale prices using a dataset from King County, WA, spanning May 2014 to May 2015. The analysis moves from initial data exploration to feature engineering, model comparison, and hyperparameter tuning to identify the most accurate prediction model and the key drivers of house value.

Dataset

Source: https://www.kaggle.com/datasets/minasameh55/king-country-houses-aa on Kaggle.
Content: The dataset cantains 20 house features plus the price, along with 21613 observations.

Key Questions Investigated

This analysis seeks to answer several key questions about house pricing in King County:

What are the primary drivers of house prices in King County?
How much can we improve model accuracy through feature engineering?
Which machine learning model provides the most accurate predictions for this dataset?

Dataset

The dataset contains 21,613 records of house sales in King County, WA. It includes 21 features for each house, such as square footage, number of bedrooms/bathrooms, location coordinates, and grade.

Target Variable: price

Key Features: sqft_living, grade, lat, long, zipcode, yr_built.

Analytical Workflow

The project is structured into a clear, multi-step process:

Exploratory Data Analysis (EDA): We began by visualizing the data to understand its distribution and identify key relationships. A major finding was the significant right-skew of the price variable and strong correlations between price and features like sqft_living and grade.

Baseline Modeling: To establish a performance benchmark, we first trained several models (Linear Regression, XGBoost, etc.) on the raw, unprocessed data. This gave us an initial R² score of ~0.80.

Feature Engineering: This was the most critical phase. We transformed the raw data to create more meaningful features for our models:

Temporal Features: Created age and years_since_renovation from timestamp columns.

Categorical Features: Applied one-hot encoding to the zipcode column to treat each location as a distinct category.

Target Transformation: Applied a log transform (np.log1p) to the price variable to normalize its distribution, which significantly improved the performance of many models.

Advanced Model Comparison: We re-ran a comprehensive suite of regression models (including Random Forest, Gradient Boosting, and XGBoost) on the newly engineered data to compare their performance.

Hyperparameter Tuning: We took the top-performing model, XGBoost, and used GridSearchCV to fine-tune its internal parameters, further optimizing its predictive accuracy.

Feature Importance Analysis: Finally, we inspected our best model to identify which features it found most influential in determining house prices.

Key Findings & Results

Feature Engineering is Crucial: The single most impactful step was feature engineering. It improved the R² score from a baseline of ~0.80 to ~0.87, demonstrating its critical importance.

Best Model: XGBoost Regressor was the clear winner, consistently outperforming all other models both before and after hyperparameter tuning.

Most Important Features: The final model revealed that the most significant drivers of house prices are:

sqft_living (Overall size)

grade (Construction quality)

lat & long (Geographic location)

age (The age of the house)

How to Run This Project

To replicate this analysis, please follow these steps:

Clone the repository:

git clone [your-repo-link]

Install dependencies: Ensure you have the necessary Python libraries installed.

pip install pandas numpy scikit-learn xgboost seaborn matplotlib jupyter

Launch Jupyter Notebook:

jupyter notebook

Run the main.ipynb notebook: Open the notebook and run the cells sequentially to see the full analysis. The dataset king_ country_ houses_aa.csv must be in the same directory.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
KingCounty_HousePrice_Analysis_Updated.pdf		KingCounty_HousePrice_Analysis_Updated.pdf
LICENSE		LICENSE
README.md		README.md
main.ipynb		main.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Predicting House Prices in King County, WA

Project Overview

Dataset

Key Questions Investigated

Dataset

Analytical Workflow

Key Findings & Results

How to Run This Project

About

Uh oh!

Releases

Packages

Languages

License

Anirudh-Unni/KingsCounty-HousePrices

Folders and files

Latest commit

History

Repository files navigation

Predicting House Prices in King County, WA

Project Overview

Dataset

Key Questions Investigated

Dataset

Analytical Workflow

Key Findings & Results

How to Run This Project

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages