Skip to content

linobollansee/property-value-maximizer

Repository files navigation

Property-Value-Maximizer

Welcome to the Property-Value-Maximizer project! This initiative aims to apply Machine Learning and regression algorithms to accurately predict house prices in Ames, Iowa. Our client has inherited four properties and seeks to maximize their market value before selling. By analyzing key housing features and building a powerful predictive model, we strive to provide data-driven insights that lead to optimal pricing strategies.

Responsive-view-multi-device-readme

The project is accessible at the following URL: https://property-value-maximizer.onrender.com

Table of Contents

Dataset Content

  • The dataset is sourced from Kaggle. We then created a fictitious user story where predictive analytics can be applied in a real project in the workplace.
  • The dataset has almost 1.5 thousand rows and represents housing records from Ames, Iowa, indicating house profile (Floor Area, Basement, Garage, Kitchen, Lot, Porch, Wood Deck, Year Built) and its respective sale price for houses built between 1872 and 2010.
Variable Meaning Units
1stFlrSF First Floor square feet 334 - 4692
2ndFlrSF Second-floor square feet 0 - 2065
BedroomAbvGr Bedrooms above grade (does NOT include basement bedrooms) 0 - 8
BsmtExposure Refers to walkout or garden level walls Gd: Good Exposure; Av: Average Exposure; Mn: Minimum Exposure; No: No Exposure; None: No Basement
BsmtFinType1 Rating of basement finished area GLQ: Good Living Quarters; ALQ: Average Living Quarters; BLQ: Below Average Living Quarters; Rec: Average Rec Room; LwQ: Low Quality; Unf: Unfinshed; None: No Basement
BsmtFinSF1 Type 1 finished square feet 0 - 5644
BsmtUnfSF Unfinished square feet of basement area 0 - 2336
TotalBsmtSF Total square feet of basement area 0 - 6110
GarageArea Size of garage in square feet 0 - 1418
GarageFinish Interior finish of the garage Fin: Finished; RFn: Rough Finished; Unf: Unfinished; None: No Garage
GarageYrBlt Year garage was built 1900 - 2010
GrLivArea Above grade (ground) living area square feet 334 - 5642
KitchenQual Kitchen quality Ex: Excellent; Gd: Good; TA: Typical/Average; Fa: Fair; Po: Poor
LotArea Lot size in square feet 1300 - 215245
LotFrontage Linear feet of street connected to property 21 - 313
MasVnrArea Masonry veneer area in square feet 0 - 1600
EnclosedPorch Enclosed porch area in square feet 0 - 286
OpenPorchSF Open porch area in square feet 0 - 547
OverallCond Rates the overall condition of the house 10: Very Excellent; 9: Excellent; 8: Very Good; 7: Good; 6: Above Average; 5: Average; 4: Below Average; 3: Fair; 2: Poor; 1: Very Poor
OverallQual Rates the overall material and finish of the house 10: Very Excellent; 9: Excellent; 8: Very Good; 7: Good; 6: Above Average; 5: Average; 4: Below Average; 3: Fair; 2: Poor; 1: Very Poor
WoodDeckSF Wood deck area in square feet 0 - 736
YearBuilt Original construction date 1872 - 2010
YearRemodAdd Remodel date (same as construction date if no remodelling or additions) 1950 - 2010
SalePrice Sale Price 34900 - 755000

Terminology

Sale Price is the current market value of a house, based on its characteristics and features.

Inherited House is a property that the client has inherited and requires an assessment of its market value.

Summed Price is the total of the predicted market prices for all four houses inherited by the client.

Business Requirements

Our client has inherited four properties from her late great-grandfather, located in Ames, Iowa, USA. While she has a strong understanding of property prices in her home country, she is concerned that relying on her existing knowledge of the Iowan market may result in inaccurate appraisals. Factors that make a house desirable and valuable in her country may differ from those in Ames, Iowa.

The client has provided a public dataset containing house prices for the Ames area and has requested our assistance in maximizing the sale price for her inherited properties. Our goal is to predict the sale price of these four homes based on their respective attributes.

The business requirements are as follows:

  • BR1 - The client wants to understand how various house attributes correlate with the sale price in Ames, Iowa. She expects data visualizations that illustrate the relationships between these variables and the sale price.

  • BR2 - The client is looking to predict the sale price for her four inherited houses, as well as for any other property in Ames, Iowa.

To meet these business requirements, Epics and User Stories have been defined. These user stories have been further broken down into manageable tasks, allowing for an agile approach to implementation.

Agile Methodology

Epics

  • Data Collection and Information Gathering Epic
  • Data Visualization, Cleaning, and Preparation Epic
  • Model Training, Optimization, and Validation Epic
  • Dashboard Planning, Design, and Development Epic
  • Dashboard Deployment and Release Epic

User Stories

  • Data Collection and Information Gathering Epic

    • User Story 1.1: Install Required Dependencies and Packages - Business Requirement 1 & 2

      • As a developer, I need to install all required dependencies and packages so that I can effectively utilize the necessary tools for project implementation.
        • Acceptance Criteria:
          • All dependencies are installed successfully without errors.
        • Tasks:
          • Install all required dependencies using the PIP package manager.
        • How they were completed:
          • The command pip install -r requirements.txt was typed in the IDE terminal.
    • User Story 1.2: Import Relevant Data into Jupyter Notebook - Business Requirement 1 & 2

      • As a developer, I need to import relevant data into a Jupyter Notebook so that I can conduct a thorough analysis of the dataset.
        • Acceptance Criteria:
          • The housing dataset is successfully downloaded from Kaggle.
          • The dataset is in CSV format and correctly read using Pandas.
        • Tasks:
          • Download the housing dataset from Kaggle using the Kaggle API.
          • Read the CSV files into DataFrames using Pandas.
        • How they were completed:
          • The kaggle.json credentials file was downloaded from kaggle.com and used to authenticate with the API to download the housing dataset.
          • The pd.read_csv() function was used to read the CSV files into a DataFrame.
  • Data Visualization, Cleaning, and Preparation Epic

    • User Story 2.1: Data Cleaning and Quality Assurance - Business Requirement 1
      • As a developer, I want to implement a robust data cleaning process so that I can ensure the dataset is accurate, reliable, and of high quality.
        • Acceptance Criteria:
          • All missing or null values in the dataset must be identified.
          • A data profile report must be generated.
          • Visualizations should demonstrate the effect of cleaning.
          • Missing values are imputed.
        • Tasks:
          • Inspect the dataset to identify missing or null values.
          • A complete data profile report is generated.
          • Create visualizations (bar charts, box plots, histograms).
          • Apply imputation to missing values.
        • How they were completed:
          • The expression df.isnull() or df.isna() returns missing values.
          • the expression ProfileReport(df=df, minimal=True) generates an automated exploratory data analysis (EDA) report.
          • A custom function called DataCleaningEffect() visualizes the effects of cleaning.
          • The class instantiations MeanMedianImputer(imputation_method='mean') and MeanMedianImputer(imputation_method='median') were created, followed by the fit_transform() method.
  • Model Training, Optimization, and Validation Epic

    • User Story 3.1: Model Performance Evaluation - Business Requirement 2

      • As a developer, I want to evaluate the performance of the predictive model so that I can ensure the reliability and accuracy of its predictions.
        • Acceptance Criteria:
          • The predictive model must be evaluated to ensure reliability and accuracy of its predictions.
        • Tasks:
          • Evaluate the predictive model to ensure reliability and accuracy of its predictions.
        • How they were completed:
          • An R2 score of at least 0.75 on the train set and test set was measured.
    • User Story 3.2: Individual Prediction Testing - Business Requirement 2

      • As a developer, I want to test individual data points against the model’s predictions so that I can determine the target variable based on my provided features.
        • Acceptance Criteria:
          • Individual data points must be tested against the model's predictions to determine the target variable.
        • Tasks:
          • Test individual data points against the model's predictions to determine the target variable.
        • How they were completed:
          • Plots were generated that measure Actual vs Prediction for both train and test sets.
  • Dashboard Planning, Design, and Development Epic

    • User Story 4.1: Streamlit Landing Page Access - Business Requirement 1 & 2

      • As a client, I want to access the Streamlit landing page so that I can quickly gain an overview of the project.
        • Acceptance Criteria:
          • The client should be able to quickly gain an overview of the project through the Streamlit landing page.
        • Tasks:
          • Create a streamlit landing page that allows the client to quickly gain an overview of the project.
        • How they were completed:
          • A Streamlit multi-page application with a sidebar was created to allow the client to quickly gain an overview of the project.
    • User Story 4.2: Data Visualization for Insights - Business Requirement 1

      • As a client, I want to view data visualizations that illustrate the relationship between the target variable and its key features so that I can gain deeper insights from the data.
        • Acceptance Criteria:
          • The client should be able to view data visualizations that illustrate the relationship between the target variable.
        • Tasks:
          • Create a streamlit page that shows data visualizations that illustrate the relationship between the target variable and its key features.
        • How they were completed:
          • A correlation analysis streamlit page was created that shows data visualizations that illustrate the relationship between the target variable and its key features.
    • User Story 4.3: Correlation Analysis View - Business Requirement 1

      • As a client, I want to view a correlation analysis page on Streamlit so that I can understand the relationships between various features and the target variable.
        • Acceptance Criteria:
          • The correlation analysis page has to be accessible through the Streamlit sidebar.
          • The page should display visual representation between features and the target variable.
          • The page should allow the client to interact with the heatmap.
        • Tasks:
          • Create a correlation analysis page that is accessible through the Streamlit sidebar.
          • Create a heatmap or visual representation of the correlations between features and the target variable.
          • Create a page that allows the client to interact with the heatmaps.
        • How they were completed:
          • The correlation analysis page was created and made accesible through the Streamlit sidebar by adding its body function correlation_analysis_body() to app.py.
          • The visual representations were created with px.histogram for histograms, px.imshow for heatmaps, and px.scatter for scatter plots.
          • The heatmaps were plotted on the page with Plotly which has built-in interactivity.
    • User Story 4.4: Key Features for Sale Price Prediction - Business Requirement 1

      • As a client, I want to identify the key attributes of a house that have the strongest correlation with its potential sale price so that I can make data-driven pricing decisions.
        • Acceptance Criteria:
          • The client should be able to identify the key attributes of a house that have the strongest correlation with its potential sale price.
        • Tasks:
          • Perform pearson and spearman correlation analysis to find the relationship between different features and the sale price.
        • How they were completed:
          • The code df.corr(method="pearson") was used to calculate pearson correlation, and df.corr(method="spearman") to calculate spearman correlation on the DataFrame.
    • User Story 4.5: Interactive Prediction Input - Business Requirement 2

      • As a client, I want interactive input fields that allow me to enter custom data so that I can generate personalized predictions for the target variable.
        • Acceptance Criteria:
          • The input fields should allow the user to enter values for each feature or variable that influences the prediction.
          • Each input field must have validation to ensure the entered data is in the correct format.
        • Tasks:
          • Create input fields that allow the user to enter values for each feature or variable that influences the prediction.
          • Create input fields with validation to ensure the entered data is in the correct format.
        • How they were completed:
          • Streamlit widgets were created with st.number_input to allow the user to enter values for each feature or variable that influences the prediction.
          • Input widgets were given a defined min_value and max_value to ensure the input is within a realistic range.
    • User Story 4.6: Accurate Sale Price Prediction - Business Requirement 2

      • As a client, I want the most accurate possible prediction of the sale prices for the inherited properties so that I can maximize the financial returns from selling the four houses.
        • Acceptance Criteria:
          • The sale prices of the inherited properties must be accurately predicted.
        • Tasks:
          • Predict the price accurately of the inherited properties.
        • How they were completed:
          • A machine learning regression model was used to accurately predict the price of the inherited properties.
    • User Story 4.7: Predictive Model Dashboard - Business Requirement 2

      • As a developer, I need to create a dashboard to effectively visualize and communicate the results of the model's predictions.
        • Acceptance Criteria:
          • A streamlit dashboard must be created.
          • The dashboard must visualize and communicate the results of the model's predictions.
        • Tasks:
          • Create a streamlit dashboard.
          • Create a dashboard that visualizes and communicates the results of the model's predictions.
        • How they were completed:
          • The Python streamlit library was used to create a dashboard.
          • A dashboard was created that displays the model's predictions through DataFrames and a Sales Price calculator.
  • Dashboard Deployment and Release Epic

    • User Story 5.1: Early Deployment on Render - Business Requirement 1 & 2
      • As a developer, I want to initiate the deployment process of my application on Render at an early stage so that I can conduct end-to-end manual deployment testing from the outset.
        • Acceptance Criteria:
          • The application must be successfully deployed to Render.
          • Build and start commands must be correctly configured.
          • The environment variables must be configured correctly for deployment.
          • Deployment is automated with auto-deploy.
        • Tasks:
          • Deploy the application to Render.
          • Define the necessary build and start commands in Render settings.
          • Configure environment variables required for deployment.
          • Enable auto-deploy from the connected repository.
        • How they were completed:
          • A new Web Service was created on Render.
          • The build command was set to pip install -r requirements.txt && ./setup.sh and the start command to streamlit run app.py.
          • Environment variables were set to PORT Value: 8501 and PYTHON_VERSION Value: 3.12.1.
          • Auto-deploy settings were set to Yes.

Hypothesis and how to validate hypothesis

  • First Hypothesis: The Relationship Between Property Size and Sale Price

    • Our first hypothesis posits that the size of a property has a direct and positive influence on its sale price. This assumption is grounded in the widely accepted notion that larger properties tend to offer more space and functionality, which in turn, makes them more attractive to potential buyers. The increased square footage of a property typically allows for additional rooms, larger living areas, and greater customization options, all of which are desirable attributes in a real estate market. Consequently, it is expected that properties with greater size will command higher sale prices due to their enhanced utility and appeal.
      • How to validate hypothesis: We will examine the relationship between house size attributes and sale price to test this hypothesis.
      • Hypothesis Confirmation: Following a rigorous correlation analysis of the dataset, we observed a positive and moderate correlation between the size-related features of the properties and their sale prices. This finding validates our hypothesis, as it indicates that larger properties indeed tend to sell for higher prices. The data clearly supports the notion that, all other factors being equal, the size of a property plays a significant role in determining its market value, confirming our initial assumption.
  • Second Hypothesis: The Impact of Overall Quality on Sale Price

    • Our second hypothesis focuses on the role of a property's overall quality in influencing its sale price. We hypothesize that properties with higher quality ratings, which reflect superior materials, craftsmanship, and design, will be priced higher in the market. Buyers are likely to place a premium on well-constructed homes that offer longevity, comfort, and aesthetic appeal, which in turn boosts their market value. As such, homes with higher quality ratings should be more desirable and consequently demand higher prices.
      • How to validate hypothesis: We will examine the correlations between various attributes related to house quality assessment, such as 'OverallQual' and 'KitchenQual,' in order to validate the hypothesis.
      • Hypothesis Confirmation: After analyzing the data, we confirmed that there is a strong correlation between a property's overall quality rating and its sale price. Homes that received higher quality ratings were consistently priced higher in the market, reinforcing the idea that construction quality plays a pivotal role in determining a property’s value. This analysis supports our hypothesis that factors such as the quality of materials, craftsmanship, and overall design are crucial in shaping buyer perceptions and influencing the final sale price.
  • Third Hypothesis: The Influence of Property Condition on Market Value

    • For our third hypothesis, we investigate how a property's condition affects its sale price. We hypothesize that homes in excellent condition, particularly those that have undergone recent renovations or are newly built, will be more desirable to buyers and therefore will command higher sale prices. The condition of a property often reflects its upkeep and can signal to buyers the level of maintenance and care invested in the home. Properties in better condition are generally perceived as more move-in ready, which makes them more attractive to prospective buyers looking for immediate comfort without the need for costly repairs or improvements.
      • How to validate hypothesis: We will explore the data related to 'YearBuilt' and 'YearRemodAdd' to validate this hypothesis.
      • Hypothesis Confirmation: Our analysis supports this hypothesis by revealing a positive and moderate correlation between sale price and key factors such as the property's construction year and the year of its last remodel. The data suggests that newer homes and those with recent upgrades tend to sell at higher prices, highlighting the importance of property condition in the pricing process. The findings confirm that well-maintained homes or those with modern features are more likely to achieve higher sale prices, underscoring the influence of condition on market value.

Rationale to map the business requirements to the Data Visualisations and ML tasks

ML Business Case

In this section, we will outline the business case for the machine learning (ML) model, focusing on the project goals, requirements, and methodologies that align with the client’s needs. We will expand on key aspects such as business requirements, the feasibility of using traditional analysis, and the project’s inputs and outputs.

  • Business Requirements

    • BR1: The client wants to understand how various house attributes correlate with the sale price in Ames, Iowa. She expects data visualizations that illustrate the relationships between these variables and the sale price.

    • BR2: The client is looking to predict the sale price for her four inherited houses, as well as for any other property in Ames, Iowa.

  • Can Traditional Data Analysis Be Used?

    • Traditional data analysis methods could offer some insights, but they would have limitations:

      • Approximating Sale Prices: One approach would be for the client to manually draw inferences about the sale prices of the inherited houses by comparing them with houses of similar features in the dataset. While this might offer a rough estimate, the approach is inherently subjective and lacks precision. It’s also prone to human error and biases.

      • Subjectivity and Inaccuracy: Traditional methods, such as comparing houses in a simple spreadsheet or using basic statistical measures, may lead to inaccuracies due to the complexity of real estate pricing. Factors such as overall quality, above-ground living area, garage area, the year the property was built, total basement area might not be fully accounted for, leading to imprecise conclusions.

      • Thus, using an ML model is a far more reliable and accurate method for predicting house prices based on multiple variables.

  • Does the Customer Need a Dashboard or API?

    • The client’s requirements lean toward having a dashboard for visualization and predictions:

      • Dashboard Needs: The client does not require an API at this point, as their focus is on visualizing the data and receiving predictions for house prices. The dashboard will provide an interactive way to explore the data, view the correlation of house attributes with sale prices, and input attributes for new houses to receive predicted prices in real-time.
      • User Interaction: A user-friendly dashboard allows the client to easily interact with the model and make predictions for various houses, ensuring that the solution is accessible to users without a technical background.
  • A Successful Project Outcome

    • Success for this project is defined by the following objectives:

      • Accurate Correlation Insights: The client will benefit from an analysis that highlights the most important variables affecting house sale prices. This insight is crucial for pricing strategy, allowing the client to better assess the value of their inherited properties.
      • Predictive Model Success: The client will consider the project a success if the machine learning model accurately predicts house sale prices based on the attributes provided, especially for the four inherited houses. The key is to help the client maximize the sale price for these properties by providing reliable predictions.
  • Ethical and Privacy Concerns

    • The dataset used for this analysis is public, meaning it has been made available by authorities for public use, and no personal or private information is involved.

      • No Privacy Issues: Since the data does not contain sensitive or personally identifiable information (PII), there are no ethical or privacy concerns associated with the use of this dataset.
      • Public Data Sources: As the dataset is openly available for anyone to access, the project operates transparently with no legal or ethical barriers.
  • EPICS and User Stories for Agile Implementation

    • The project is structured using the Agile methodology, with clear EPICS and user stories that break the work into manageable chunks. EPICS refer to the large bodies of work, and user stories outline specific tasks. See the Agile Methodology section for more details.
  • Does the Data Suggest a Particular Model?

    • Based on the nature of the task, where we are predicting a continuous numeric value (sale price), a regression model is most appropriate:

      • Regression for Continuous Output: Regression models are designed to predict continuous outcomes based on input features. For this case, the model will predict the sale price of a house based on its attributes.
  • Project Inputs and Intended Outputs

    • Model Inputs:

      • The model will take house attributes from the dataset. The features of these houses will be used to train the model and make predictions.
    • Model Outputs:

      • The output of the model will be the predicted sale price of the house, represented in USD as a continuous numeric value.
      • Additionally, the client will receive the sum of the predicted sale prices for all four inherited houses combined.
    • User Interaction:

      • The dashboard will allow users to input house attributes (OverallQual, GrLivArea, GarageArea, YearBuilt, TotalBsmtSF) through interactive widgets. In return, the dashboard will provide the user with an estimated sale price for any given house.
  • What Does Success Look Like?

    • Success is measured based on the following criteria:

      • R-squared Score (R²): A key performance indicator for this project is an R² score of at least 0.75 on both the training and test sets. The R² score measures how well the model’s predictions align with actual values, with 1 being perfect and 0 indicating no correlation. A score of 0.75 or higher indicates that the model can reliably predict sale prices.
  • How Will the Client Benefit?

    • The primary benefit to the client is the ability to:

      • Maximize Sale Price: By using the model’s predictions, the client can optimize the pricing of their inherited properties. The insights from the model will help them determine the most competitive pricing strategy based on market conditions and property features.
      • Efficient Decision-Making: With accurate predictions, the client will make more informed decisions on how to price their houses and potentially increase their profitability. The interactive dashboard also empowers them with the tools to make these decisions in real-time.
      • This project’s outcome will significantly enhance the client’s ability to assess and act on the sale prices of inherited houses.

Cross-industry standard process for data mining

This project applies the CRISP-DM (CRoss Industry Standard Process for Data Mining) methodology.

Phase Explanation
Business Understanding This phase focuses on understanding the project objectives and requirements from a business perspective. The goal is to define the problem, set objectives, and determine the data mining goals to achieve business success.
Data Understanding In this phase, the focus is on collecting initial data and understanding its quality, content, and structure. It involves exploratory data analysis to uncover insights, patterns, and potential issues.
Data Preparation This phase involves cleaning and transforming raw data into a suitable format for modeling. It includes tasks like dealing with missing data, outlier detection, and feature engineering.
Modeling In this phase, various data mining techniques (such as classification, regression, clustering, etc.) are applied to the prepared data to create models. It is often an iterative process where models are trained, tested, and refined.
Evaluation After the model has been built, this phase evaluates its performance based on predefined criteria. The model is assessed to ensure it meets business goals and objectives before it is deployed.
Deployment The final phase focuses on implementing the data mining solution into the business environment. This includes integrating the model into production systems, delivering results, and monitoring its impact on business processes.

Data Preprocessing

Data Cleaning Pipeline

A data cleaning pipeline was developed to handle missing values. Various imputation methods were applied based on the statistical properties of the variables.

  • Mean Imputation for Normally Distributed Continuous Variables

    • For continuous features such as LotFrontage and BedroomAbvGr, missing values were imputed using the mean. This approach is suitable for variables that follow an approximately normal distribution without significant outliers, as it maintains the overall data distribution without skewing the central tendency.
  • Median Imputation for Skewed Continuous Variables

    • Variables exhibiting right-skewed distributions, such as 2ndFlrSF and MasVnrArea, were imputed using the median. Since the median is less sensitive to extreme values, it provides a more robust imputation strategy for skewed data, preventing artificial distortion of the dataset.
  • Categorical Variable Imputation with 'None'

    • Categorical features like GarageFinish, BsmtFinType1, and BsmtExposure were missing primarily because these attributes did not apply to certain properties (e.g., a house without a basement). To preserve this structural information, missing values were imputed with "None" rather than the mode, ensuring that the absence of a feature is explicitly represented rather than inferred as a common category.
  • Feature Removal Due to High Missingness

    • Features such as EnclosedPorch, GarageYrBlt, and WoodDeckSF contained a substantial proportion of missing values. Rather than imputing them with limited available observations which could introduce bias, these features were removed from the dataset. Their exclusion was justified based on their potential lack of predictive power and the risk of introducing noise into the model.

For imputation rationale, refer to the detailed analysis in the following notebook: https://github.com/linobollansee/property-value-maximizer/blob/main/jupyter_notebooks/02%20-%20DataCleaning.ipynb

Feature Engineering

Categorical encoding

Categorical encoding was applied to convert ordinal categories into numerical values, preserving both the order and hierarchy of the categories. This allowed the regression analysis to account for their relative rankings. However, during the data cleaning process, most ordinal categories were removed.

Numerical Transformations

Feature Assessment Applied Transformation
TotalBsmtSF Mean imputation proved to be the most effective method for handling missing values. MeanMedianImputer
GrLivArea A logarithmic transformation was the best approach to achieve normalization. LogTransformer
TotalBsmtSF Power transformation yielded the most effective normalization. PowerTransformer
TotalBsmtSF, GarageArea Outliers were best handled using Winsorization with the IQR method. Winsorizer
TotalBsmtSF, GrLivArea, GarageArea Standard scaling provided the most effective way to normalize feature ranges. StandardScaler

Dashboard Design

Streamlit sidebar

  • Streamlit sidebar: A Streamlit sidebar is a UI component in Streamlit that allows you to place widgets, controls, and other elements in a collapsible side panel. It helps organize interactive elements separately from the main content, improving usability and layout. This sidebar supports:
    • Navigation – Allow users to switch between different pages or sections of an app. Pages available:
      • 👁️ Project Overview
      • 📈 Correlation Analysis
      • 🔮 Sale Price Prediction
      • 🔬 Hypothesis Validation
      • 🤖 Machine Learning Model

Business requirements covered:

  • BR1 - The client wants to understand how various house attributes correlate with the sale price in Ames, Iowa. She expects data visualizations that illustrate the relationships between these variables and the sale price.

  • BR2 - The client is looking to predict the sale price for her four inherited houses, as well as for any other property in Ames, Iowa.

streamlit-sidebar-readme.png

Project Overview page

  • Project Overview page
    • Project Overview: Describes the objective of maximizing property sale prices using a machine learning model.
    • Project Dataset: Provides details about the data source, size, and attributes used for analysis.
    • Business Requirements: Outlines the goals of analyzing house characteristics and developing a predictive pricing model.
    • Additional Information: Refers to further details available in a README file.

Business requirements covered:

  • BR1 - The client wants to understand how various house attributes correlate with the sale price in Ames, Iowa. She expects data visualizations that illustrate the relationships between these variables and the sale price.

  • BR2 - The client is looking to predict the sale price for her four inherited houses, as well as for any other property in Ames, Iowa.

project-overview-readme

Correlation Analysis page

  • Correlation Analysis page
    • Correlation Analysis: Focuses on examining the relationship between different house features and sale prices. The goal is to use data visualizations to highlight how these features impact pricing.
    • Inspect house data from the dataset checkbox: If this box is checked, it loads the dataset as a DataFrame table. It can be downloaded as a CSV file, searched, or set to full screen.
    • The analysis investigated factors affecting house sale prices, aiming to identify key variables influencing pricing trends. The correlation analysis highlighted the following variables most closely linked to sale prices: '1stFlrSF', 'GarageArea', 'GarageYrBlt', 'GrLivArea', 'KitchenQual_Ex', 'KitchenQual_TA', 'OverallQual', 'TotalBsmtSF', 'YearBuilt', and 'YearRemodAdd'.
    • The analysis identified key factors affecting home values: larger homes with more features are more valuable, better condition and higher-quality materials increase value, and newly built or recently renovated homes tend to have higher market prices.
    • Data Visualizations: Interactive plotly plots that can be downloaded as PNG, zoomed, panned, autoscaled, have its axes reset, and be viewed in fullscreen.
      • Distribution of target variable checkbox: If this box is checked, it loads an interactive plotly histogram of the SalePrice.
      • Show Correlations and PPS Heatmaps: If this box is checked, it loads three interactive plotly heatmaps, two featuring spearman and pearson correlation heatmaps, and one featuring a predictive power score heatmap.
      • Variables Plots - Visual Analysis: If this box is checked, it loads eight interactive plotly scatter plots, 1stFlrSF, GarageArea, GarageYrBlt, GrLivArea, OverallQual, TotalBsmtSF, YearBuilt, YearRemodAdd, all individually plotted versus the SalePrice.

Business requirements covered:

  • BR1 - The client wants to understand how various house attributes correlate with the sale price in Ames, Iowa. She expects data visualizations that illustrate the relationships between these variables and the sale price.

correlation-analysis-readme

Sale Price Prediction page

  • Sale Price Prediction page
    • Inherited Houses (Filtered Data for Prediction) are displayed in a DataFrame table with the most important house features for price prediction.
    • The Predicted Sale Prices for Inherited Houses are displayed in a DataFrame table with the most important house features for price prediction.
    • A Total Predicted Sale Price for All Inherited Houses is displayed: 💲632,680.27 and it has commas as thousands separators, and rounding to 2 decimal places to represent cents.
    • Predict Sales Price for Your Own House can be set with 5 number input widgets and calculated with a Calculate House price widget button.
    • A "SUCCESSFULLY CALCULATED" message appears along with the Price for Your Own House, which depends on input values.

Business requirements covered:

  • BR2 - The client is looking to predict the sale price for her four inherited houses, as well as for any other property in Ames, Iowa.

sale-price-prediction-readme

Hypothesis Validation page

  • Hypothesis Validation page

    • This page outlines and validates three hypotheses related to property sale prices using st.success.
      • Larger properties sell for higher prices – confirmed.
      • Higher property quality leads to higher sale prices – confirmed.
      • Better property condition (newer or renovated) results in higher sale prices – confirmed.
  • Hypothesis Validation page hypothesis-validation-readme

Business requirements covered:

  • BR1 - The client wants to understand how various house attributes correlate with the sale price in Ames, Iowa. She expects data visualizations that illustrate the relationships between these variables and the sale price.

Machine Learning Model page

  • Machine Learning Model page
    • A model was developed and optimized to predict property sales prices with a focus on achieving a specific level of accuracy.
    • The best model's pipeline performance was evaluated on both the train and test sets, showing strong results.
    • The pipeline steps, key features, feature importance, performance, and regression results are presented below.
    • The pipeline consists of several steps, including imputation, transformation, scaling, and modeling, with the final model being an ExtraTreesRegressor.
    • The model was trained using the following features, with their importance ranking as indicated: OverallQual, GrLivArea, GarageArea, YearBuilt,TotalBsmtSF`
    • Feature importance is plotted.
    • The pipeline successfully met the performance goals of an R2 score of at least 0.75 for both the train and test sets. The model evaluation on the train set shows strong performance across several metrics.
    • Mean Absolute Error (MAE): Measures the average absolute difference between predicted and actual values. Lower is better.
    • R² Score: Indicates how well the model explains the variance in the data. Higher (closer to 1) is better.
    • Mean Squared Error (MSE): Measures the average squared difference between predictions and actual values. Lower is better, but it penalizes larger errors more heavily.
    • Root Mean Squared Error (RMSE): The square root of MSE, giving error in the same unit as the target variable. Lower is better.
    • Mean Absolute Percentage Error (MAPE): Measures prediction accuracy as a percentage of the actual values. Lower is better, but problematic with small values.
    • Explained Variance Score (EVS): Measures how much of the variance in the data is explained by the model. Closer to 1 is better.
    • Median Absolute Error (MedAE): The median of absolute errors. Robust to outliers, and smaller values are better.
    • The regression performance plots show that the model effectively predicts sale prices, although its reliability decreases for higher-priced houses.

Business requirements covered:

  • BR2 - The client is looking to predict the sale price for her four inherited houses, as well as for any other property in Ames, Iowa.

  • Machine Learning Model page machine-learning-model-readme

Plots

Distribution and Correlation Plots

These plots give insight into the distribution and relationships within the data.

Histogram of Sale Price to visualize its distribution. Histogram Sale Price

Heatmap showing Pearson correlation between variables. Heatmap Corr Pearson

Heatmap showing Spearman correlation between variables. Heatmap Corr Spearman

Heatmap illustrating Partial Pairwise Correlations. Heatmap PPS

Box Plots

These box plots illustrate how various features impact house prices.

Box Plot of Price by Kitchen Quality shows price variation across kitchen quality levels. Box Plot Price by Kitchen Quality

Box Plot of Price by Overall Quality visualizes price differences by overall house quality. Box Plot Price by Overall Quality

Line Plots

These line plots depict trends in house prices over time and with building age.

Line Plot of Price by Year Built reveals trends in house prices over the years. Line Plot Price by Year Built

Line Plot of Price by Year Remodeled shows how price changes with remodeling. Line Plot Price by Year Remodeled

Linear Model Plots

These plots show the relationships between house prices and specific features, based on linear regression.

Linear Model of Price by 1st Floor Area demonstrates the impact of floor space on price. Linear Model Price by 1st Floor Area

Linear Model of Price by Garage Area illustrates the influence of garage size on house prices.

  • Linear Model Price by Garage Area

Linear Model of Price by GrLiv Area visualizes the effect of the ground living area on prices. Linear Model Price by GrLiv Area

Linear Model of Price by MasVnr Area shows the relationship between masonry veneer area and price. Linear Model Price by MasVnr Area

Linear Model of Price by Open Porch Area illustrates how open porch space affects price. Linear Model Price by Open Porch Area

Linear Model of Price by Total Basement Area highlights the impact of basement area on price. Linear Model Price by Total Basement Area

Performance and Feature Importance

These plots help evaluate model performance and highlight the most important features.

Feature Importance Plot identifies the most influential features for house price prediction. Features Importance

Regression Performance Plot shows the effectiveness of the regression model. Regression Performance

Bugs and Fixes

  • The ppscore library caused the following error:

ModuleNotFoundError: No module named 'pkg_resources'

pkg-resources-bug

This bug was fixed by adding setuptools==75.8.0 to requirements.txt, as it is now necessary to have setuptools installed to use ppscore with current Python versions.

Many are unaware of this very simple solution that took me weeks to find, and instead revert to older Python versions, often facing new package dependency conflicts, package availability problems, incompatible pickle files, and other deployment issues.

  • During the hyperparameter optimization search, an extreme amount of FutureWarnings cluttered the output cell. It prevented the Jupyter Notebook from being managable: future-warning-bug

I added this code to prevent the clutter:

import warnings
warnings.filterwarnings("ignore")
os.environ["PYTHONWARNINGS"] = "ignore"

When my Render deployment stopped working, I discovered I had exceeded my 500 Free Pipeline Minutes. I bought 1000 extra minutes to make sure deployment would continue functioning as normal.

Project Testing

User Story Testing

User Story Action Expected Result Result
1.1 Run pip install -r requirements.txt and verify with pip list Dependencies are installed and listed correctly Successful
1.2 Run pd.read_csv('data.csv') in Jupyter Dataset loads without errors Successful
2.1 Run df.isnull().sum() and ProfileReport(df) Missing values identified, profile report generated Successful
3.1 Train model and evaluate using r2_score(y_test, y_pred) R2 score ≥ 0.75 for train/test sets Successful
3.2 Generate plots with sns.scatterplot(x=y_train, y=pred_train) Actual vs Prediction plots displayed Successful
4.1 Open Streamlit app and navigate to home page Project overview is displayed Successful
4.2 View visualizations in Streamlit Data insights are correctly visualized Successful
4.3 Open correlation analysis page in Streamlit Interactive heatmap is displayed Successful
4.4 Run df.corr(method='pearson') and df.corr(method='spearman') Key features for sale price identified Successful
4.5 Enter data in Streamlit input fields Inputs are validated and accepted Successful
4.6 Predict house prices using model Accurate predictions are displayed Successful
4.7 Open predictive model dashboard in Streamlit Model results are visualized correctly Successful
5.1 Deploy app on Render and check live URL Application deploys and runs correctly Successful

Widget Testing

Test Action Expected Result Result
Widget Range Validation Input values 1-10 in OverallQual widget and attempt invalid input Valid inputs accepted, invalid inputs trigger warning Successful
Widget Input Methods Modify widget values using +/- buttons or manual entry Values update correctly for both methods Successful
Prediction Accuracy Validation Input inherited house values and compare predictions Widget predictions match regression model output Successful

PEP 8

All Python project files underwent thorough testing using the CI Python Linter, which is available at https://pep8ci.herokuapp.com/ This tool was utilized to ensure that all code adheres to PEP 8 standards, maintaining consistency, readability, and best practices across the project. The automated linting process helped identify and rectify any formatting issues, ensuring that the codebase meets high-quality standards.

Files checked:

app.py
app_pages/correlation_analysis.py
app_pages/ml_price_prediction.py
app_pages/multipage.py
app_pages/page_summary.py
app_pages/project_hypothesis.py
app_pages/sales_price_prediction.py
src/data_management.py
src/machine_learning/evaluate_reg.py
src/machine_learning/predictive_analysis_functions.py

Deployment

  1. Log in to Render.com using Github.
  2. Click on the New button, select Web Service.
  3. At Source Code, select Git Providor. Select your repository name. Click Connect.
  4. Enter a unique name for your web service.
  5. Select the Python3 language.
  6. Select the main branch.
  7. Select the Frankfurt (EU Central) Region.
  8. Set the Build Command: pip install -r requirements.txt && ./setup.sh
  9. Set the Start Command: streamlit run app.py
  10. Set Instance Type: Free
  11. Set the Environment Variables: Key: PORT Value: 8501 and Key: PYTHON_VERSION Value: 3.12.1
  12. Click Deploy Web Service

Technologies

  • Github: The project's source code is hosted on GitHub at https://github.com/.
  • GitHub Codespaces: The cloud-based integrated development environment (IDE) GitHub Codespaces at https://github.com/ was used for code editing, running Visual Studio Code (Version 1.96.3).
  • Render: The web application is deployed on Render at https://render.com/.
  • CI Python Linter: Code formatting and adherence to PEP8 standards were ensured using the CI Python Linter at https://pep8ci.herokuapp.com/
  • GoFullPage: A browser extension available at https://gofullpage.com/ for capturing full-page screenshots in Google Chrome and Microsoft Edge. It allows users to take scrolling screenshots of entire webpages without needing to stitch multiple images together manually.

Python Packages

  • Data Processing & Feature Engineering

    • feature-engine==1.6.1: A library for feature engineering in machine learning pipelines, offering transformations like encoding, imputation, and scaling.
    • pandas==2.1.1: A fundamental library for data manipulation and analysis using DataFrames and Series.
    • numpy==1.26.1: Provides support for numerical operations, arrays, and mathematical functions.
  • Data Visualization

    • matplotlib==3.8.0: A widely used library for static, animated, and interactive visualizations.
    • seaborn==0.13.2: Built on top of Matplotlib, it simplifies statistical data visualization.
    • plotly==5.17.0: Enables interactive plots, dashboards, and web-based visualizations.
  • Machine Learning & Model Evaluation

    • joblib==1.4.2: Enables efficient model serialization, parallel computing, and caching for machine learning workflows.
    • scikit-learn==1.3.1: A popular ML library offering tools for classification, regression, clustering, and preprocessing.
    • xgboost==1.7.6: An optimized gradient boosting framework widely used for structured data ML tasks.
  • Data Profiling & Exploratory Analysis

    • ppscore==1.1.0: Calculates predictive power scores to determine relationships between variables.
    • ydata-profiling==4.12.0: Generates detailed EDA reports, summarizing data characteristics, correlations, and missing values.
  • Web Applications & Image Processing

    • streamlit==1.40.2: A framework for building interactive ML and data science web apps with minimal code.
  • Others

    • kaggle==1.5.12: A library for accessing and managing Kaggle datasets via the Kaggle API.
    • setuptools==75.8.0: A package development and distribution tool, ensuring dependencies are managed properly.
    • imbalanced-learn==0.11.0: A library for handling imbalanced datasets by providing various resampling techniques.

Credits

Code

A significantly large portion of the code used in this project was directly sourced from the Code Institute. This includes:

  • Setup and Data Collection
    • Code to change working directory.
    • Code to create directories.
    • Code to download data from Kaggle.
    • Code to extract zip files.
    • Code to import CSV files.
  • Exploratory Data Analysis (EDA) and Data Cleaning
    • Code to display DataFrame (df) summaries.
    • Code to count null values.
    • Code to count duplicates.
    • Code to drop variables from a DataFrame (df).
    • Code to subset columns or rows.
    • Code to generate an EDA report.
    • Code to visualize data cleaning effect.
    • Code to plot numerical and categorical variables.
    • Code to generate a heatmap.
    • Code to generate a histogram.
  • Data Preprocessing
    • Code to apply mean imputation.
    • Code to apply median imputation.
    • Code to apply categorical imputation.
    • Code to OneHotEncode.
    • Code to apply ordinal encoding on categorical variables.
    • Code to apply a winsoriser transformation.
    • Code to apply a power transformation.
    • Code to apply a log transformation.
    • Code to apply feature scaling using standardization.
    • Code to check for feature engineering for numerical and categorical variables.
    • Code to identify highly correlated features.
    • Code to calculate correlation coefficients.
  • Data Splitting and Feature Selection
    • Code to split train and test set.
    • Code to identify the most important features by the best regression model.
    • Code to extract the best regressor from search.
    • Code to extract the best hyperparameter.
    • Code to check the best model.
  • Modeling and Hyperparameter Tuning
    • Code to perform hyperparameter optimization.
    • Code to summarizing the results of the grid searches.
    • Code to fit a machine learning pipeline.
  • Model Evaluation and Saving
    • Code to evaluate regression performance on train set and test set.
    • Code to save a machine learning model to a pickle file.
  • Dependancies
    • Code to load requirements.txt dependencies.
  • Jupyter Notebooks
    • Code of a ipynb template file.
  • Streamlit
    • Code to generate streamlit pages using an object-oriented approach.
  • README
    • Template code in markdown.

The Code Institute code is available here:

The structure and flow of the code in this project's Jupyter Notebooks, Streamlit application, and README file were initially inspired by Werner Stäblein's repository at https://github.com/Werner-Staeblein/Project-5. However, numerous enhancements and new features have been incorporated to differentiate my work.

Media

  • The Unicode icons used in this project were generated with the assistance of ChatGPT, an AI language model developed by OpenAI. These icons were selected and formatted based on UX to enhance clarity and visual communication. ChatGPT is available at: https://chatgpt.com/

  • The responsive-view image at the top of the README.md was created using: https://ui.dev/amiresponsive

Content

  • ChatGPT was frequently used to enhance text content and minimize errors in the Jupyter Notebooks, Streamlit Dashboard, and README.md file, but it was used responsibly due to its potential for mistakes caused by its own biases in training data, misinterpretation of context, and reasoning limitations.

Acknowledgements

  • I would like to acknowledge my mentor, Mo Shami, for his support throughout the project. His suggestion to explore the repositories of students doing the same project and run these repositories locally with the streamlit run app.py terminal command when the Render or Heroku deployments were unavailable was especially helpful.

  • I also would like to acknowledge Code Institute tutors Niel McEwen and Roman Rakic, for showing me how to deploy to Render.com, through the guide available at: https://code-institute-students.github.io/deployment-docs/42-pp5-pa/

  • Roman Rakic assisted me on another occasion where one of my plots became unresponsive. I had inadvertently assigned a continuous variable with too many unique values as the hue, causing the plot to hang and preventing the retrieval of any debugging information. Roman Rakic helped me identify the issue and resolve it, which better prepared me for working on this project.

  • The entire Code Intitute Slack Community for its wealth of information, in particular the project-portfolio-5-predictive-analytics channel.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages