This project provides an end-to-end machine learning workflow to predict house sale prices using a dataset from King County, WA, spanning May 2014 to May 2015. The analysis moves from initial data exploration to feature engineering, model comparison, and hyperparameter tuning to identify the most accurate prediction model and the key drivers of house value.
- Source: https://www.kaggle.com/datasets/minasameh55/king-country-houses-aa on Kaggle.
- Content: The dataset cantains 20 house features plus the price, along with 21613 observations.
This analysis seeks to answer several key questions about house pricing in King County:
-
What are the primary drivers of house prices in King County?
-
How much can we improve model accuracy through feature engineering?
-
Which machine learning model provides the most accurate predictions for this dataset?
The dataset contains 21,613 records of house sales in King County, WA. It includes 21 features for each house, such as square footage, number of bedrooms/bathrooms, location coordinates, and grade.
Target Variable: price
Key Features: sqft_living, grade, lat, long, zipcode, yr_built.
The project is structured into a clear, multi-step process:
Exploratory Data Analysis (EDA): We began by visualizing the data to understand its distribution and identify key relationships. A major finding was the significant right-skew of the price variable and strong correlations between price and features like sqft_living and grade.
Baseline Modeling: To establish a performance benchmark, we first trained several models (Linear Regression, XGBoost, etc.) on the raw, unprocessed data. This gave us an initial R² score of ~0.80.
Feature Engineering: This was the most critical phase. We transformed the raw data to create more meaningful features for our models:
Temporal Features: Created age and years_since_renovation from timestamp columns.
Categorical Features: Applied one-hot encoding to the zipcode column to treat each location as a distinct category.
Target Transformation: Applied a log transform (np.log1p) to the price variable to normalize its distribution, which significantly improved the performance of many models.
Advanced Model Comparison: We re-ran a comprehensive suite of regression models (including Random Forest, Gradient Boosting, and XGBoost) on the newly engineered data to compare their performance.
Hyperparameter Tuning: We took the top-performing model, XGBoost, and used GridSearchCV to fine-tune its internal parameters, further optimizing its predictive accuracy.
Feature Importance Analysis: Finally, we inspected our best model to identify which features it found most influential in determining house prices.
Feature Engineering is Crucial: The single most impactful step was feature engineering. It improved the R² score from a baseline of ~0.80 to ~0.87, demonstrating its critical importance.
Best Model: XGBoost Regressor was the clear winner, consistently outperforming all other models both before and after hyperparameter tuning.
Most Important Features: The final model revealed that the most significant drivers of house prices are:
sqft_living (Overall size)
grade (Construction quality)
lat & long (Geographic location)
age (The age of the house)
To replicate this analysis, please follow these steps:
Clone the repository:
git clone [your-repo-link]
Install dependencies: Ensure you have the necessary Python libraries installed.
pip install pandas numpy scikit-learn xgboost seaborn matplotlib jupyter
Launch Jupyter Notebook:
jupyter notebook
Run the main.ipynb notebook: Open the notebook and run the cells sequentially to see the full analysis. The dataset king_ country_ houses_aa.csv must be in the same directory.