Property-Value-Maximizer

Welcome to the Property-Value-Maximizer project! This initiative aims to apply Machine Learning and regression algorithms to accurately predict house prices in Ames, Iowa. Our client has inherited four properties and seeks to maximize their market value before selling. By analyzing key housing features and building a powerful predictive model, we strive to provide data-driven insights that lead to optimal pricing strategies.

The project is accessible at the following URL: https://property-value-maximizer.onrender.com

Dataset Content
Terminology
Business Requirements
Agile Methodology
- Epics
- User stories
Hypothesis and how to validate hypothesis
Rationale to map the business requirements to the Data Visualisations and ML tasks
ML Business Case
Cross-industry standard process for data mining
Data Preprocessing
- Data Cleaning Pipeline
- Feature Engineering
  - Categorical encoding
  - Numerical Transformations
Dashboard Design
Plots
Bugs and Fixes
Project Testing
Deployment
Technologies
Python Packages
Credits
- Code
- Media
- Content
Acknowledgements

Dataset Content

The dataset is sourced from Kaggle. We then created a fictitious user story where predictive analytics can be applied in a real project in the workplace.
The dataset has almost 1.5 thousand rows and represents housing records from Ames, Iowa, indicating house profile (Floor Area, Basement, Garage, Kitchen, Lot, Porch, Wood Deck, Year Built) and its respective sale price for houses built between 1872 and 2010.

Variable	Meaning	Units
1stFlrSF	First Floor square feet	334 - 4692
2ndFlrSF	Second-floor square feet	0 - 2065
BedroomAbvGr	Bedrooms above grade (does NOT include basement bedrooms)	0 - 8
BsmtExposure	Refers to walkout or garden level walls	Gd: Good Exposure; Av: Average Exposure; Mn: Minimum Exposure; No: No Exposure; None: No Basement
BsmtFinType1	Rating of basement finished area	GLQ: Good Living Quarters; ALQ: Average Living Quarters; BLQ: Below Average Living Quarters; Rec: Average Rec Room; LwQ: Low Quality; Unf: Unfinshed; None: No Basement
BsmtFinSF1	Type 1 finished square feet	0 - 5644
BsmtUnfSF	Unfinished square feet of basement area	0 - 2336
TotalBsmtSF	Total square feet of basement area	0 - 6110
GarageArea	Size of garage in square feet	0 - 1418
GarageFinish	Interior finish of the garage	Fin: Finished; RFn: Rough Finished; Unf: Unfinished; None: No Garage
GarageYrBlt	Year garage was built	1900 - 2010
GrLivArea	Above grade (ground) living area square feet	334 - 5642
KitchenQual	Kitchen quality	Ex: Excellent; Gd: Good; TA: Typical/Average; Fa: Fair; Po: Poor
LotArea	Lot size in square feet	1300 - 215245
LotFrontage	Linear feet of street connected to property	21 - 313
MasVnrArea	Masonry veneer area in square feet	0 - 1600
EnclosedPorch	Enclosed porch area in square feet	0 - 286
OpenPorchSF	Open porch area in square feet	0 - 547
OverallCond	Rates the overall condition of the house	10: Very Excellent; 9: Excellent; 8: Very Good; 7: Good; 6: Above Average; 5: Average; 4: Below Average; 3: Fair; 2: Poor; 1: Very Poor
OverallQual	Rates the overall material and finish of the house	10: Very Excellent; 9: Excellent; 8: Very Good; 7: Good; 6: Above Average; 5: Average; 4: Below Average; 3: Fair; 2: Poor; 1: Very Poor
WoodDeckSF	Wood deck area in square feet	0 - 736
YearBuilt	Original construction date	1872 - 2010
YearRemodAdd	Remodel date (same as construction date if no remodelling or additions)	1950 - 2010
SalePrice	Sale Price	34900 - 755000

Terminology

Sale Price is the current market value of a house, based on its characteristics and features.

Inherited House is a property that the client has inherited and requires an assessment of its market value.

Summed Price is the total of the predicted market prices for all four houses inherited by the client.

Business Requirements

Our client has inherited four properties from her late great-grandfather, located in Ames, Iowa, USA. While she has a strong understanding of property prices in her home country, she is concerned that relying on her existing knowledge of the Iowan market may result in inaccurate appraisals. Factors that make a house desirable and valuable in her country may differ from those in Ames, Iowa.

The client has provided a public dataset containing house prices for the Ames area and has requested our assistance in maximizing the sale price for her inherited properties. Our goal is to predict the sale price of these four homes based on their respective attributes.

The business requirements are as follows:

BR1 - The client wants to understand how various house attributes correlate with the sale price in Ames, Iowa. She expects data visualizations that illustrate the relationships between these variables and the sale price.
BR2 - The client is looking to predict the sale price for her four inherited houses, as well as for any other property in Ames, Iowa.

To meet these business requirements, Epics and User Stories have been defined. These user stories have been further broken down into manageable tasks, allowing for an agile approach to implementation.

Agile Methodology

Epics

Data Collection and Information Gathering Epic
Data Visualization, Cleaning, and Preparation Epic
Model Training, Optimization, and Validation Epic
Dashboard Planning, Design, and Development Epic
Dashboard Deployment and Release Epic

User Stories

Data Collection and Information Gathering Epic
- User Story 1.1: Install Required Dependencies and Packages - Business Requirement 1 & 2
  - As a developer, I need to install all required dependencies and packages so that I can effectively utilize the necessary tools for project implementation.
    - Acceptance Criteria:
      - All dependencies are installed successfully without errors.
    - Tasks:
      - Install all required dependencies using the PIP package manager.
    - How they were completed:
      - The command pip install -r requirements.txt was typed in the IDE terminal.
- User Story 1.2: Import Relevant Data into Jupyter Notebook - Business Requirement 1 & 2
  - As a developer, I need to import relevant data into a Jupyter Notebook so that I can conduct a thorough analysis of the dataset.
    - Acceptance Criteria:
      - The housing dataset is successfully downloaded from Kaggle.
      - The dataset is in CSV format and correctly read using Pandas.
    - Tasks:
      - Download the housing dataset from Kaggle using the Kaggle API.
      - Read the CSV files into DataFrames using Pandas.
    - How they were completed:
      - The kaggle.json credentials file was downloaded from kaggle.com and used to authenticate with the API to download the housing dataset.
      - The pd.read_csv() function was used to read the CSV files into a DataFrame.
Data Visualization, Cleaning, and Preparation Epic
- User Story 2.1: Data Cleaning and Quality Assurance - Business Requirement 1
  - As a developer, I want to implement a robust data cleaning process so that I can ensure the dataset is accurate, reliable, and of high quality.
    - Acceptance Criteria:
      - All missing or null values in the dataset must be identified.
      - A data profile report must be generated.
      - Visualizations should demonstrate the effect of cleaning.
      - Missing values are imputed.
    - Tasks:
      - Inspect the dataset to identify missing or null values.
      - A complete data profile report is generated.
      - Create visualizations (bar charts, box plots, histograms).
      - Apply imputation to missing values.
    - How they were completed:
      - The expression df.isnull() or df.isna() returns missing values.
      - the expression ProfileReport(df=df, minimal=True) generates an automated exploratory data analysis (EDA) report.
      - A custom function called DataCleaningEffect() visualizes the effects of cleaning.
      - The class instantiations MeanMedianImputer(imputation_method='mean') and MeanMedianImputer(imputation_method='median') were created, followed by the fit_transform() method.
Model Training, Optimization, and Validation Epic
- User Story 3.1: Model Performance Evaluation - Business Requirement 2
  - As a developer, I want to evaluate the performance of the predictive model so that I can ensure the reliability and accuracy of its predictions.
    - Acceptance Criteria:
      - The predictive model must be evaluated to ensure reliability and accuracy of its predictions.
    - Tasks:
      - Evaluate the predictive model to ensure reliability and accuracy of its predictions.
    - How they were completed:
      - An R2 score of at least 0.75 on the train set and test set was measured.
- User Story 3.2: Individual Prediction Testing - Business Requirement 2
  - As a developer, I want to test individual data points against the model’s predictions so that I can determine the target variable based on my provided features.
    - Acceptance Criteria:
      - Individual data points must be tested against the model's predictions to determine the target variable.
    - Tasks:
      - Test individual data points against the model's predictions to determine the target variable.
    - How they were completed:
      - Plots were generated that measure Actual vs Prediction for both train and test sets.
Dashboard Planning, Design, and Development Epic
- User Story 4.1: Streamlit Landing Page Access - Business Requirement 1 & 2
  - As a client, I want to access the Streamlit landing page so that I can quickly gain an overview of the project.
    - Acceptance Criteria:
      - The client should be able to quickly gain an overview of the project through the Streamlit landing page.
    - Tasks:
      - Create a streamlit landing page that allows the client to quickly gain an overview of the project.
    - How they were completed:
      - A Streamlit multi-page application with a sidebar was created to allow the client to quickly gain an overview of the project.
- User Story 4.2: Data Visualization for Insights - Business Requirement 1
  - As a client, I want to view data visualizations that illustrate the relationship between the target variable and its key features so that I can gain deeper insights from the data.
    - Acceptance Criteria:
      - The client should be able to view data visualizations that illustrate the relationship between the target variable.
    - Tasks:
      - Create a streamlit page that shows data visualizations that illustrate the relationship between the target variable and its key features.
    - How they were completed:
      - A correlation analysis streamlit page was created that shows data visualizations that illustrate the relationship between the target variable and its key features.
- User Story 4.3: Correlation Analysis View - Business Requirement 1
  - As a client, I want to view a correlation analysis page on Streamlit so that I can understand the relationships between various features and the target variable.
    - Acceptance Criteria:
      - The correlation analysis page has to be accessible through the Streamlit sidebar.
      - The page should display visual representation between features and the target variable.
      - The page should allow the client to interact with the heatmap.
    - Tasks:
      - Create a correlation analysis page that is accessible through the Streamlit sidebar.
      - Create a heatmap or visual representation of the correlations between features and the target variable.
      - Create a page that allows the client to interact with the heatmaps.
    - How they were completed:
      - The correlation analysis page was created and made accesible through the Streamlit sidebar by adding its body function correlation_analysis_body() to app.py.
      - The visual representations were created with px.histogram for histograms, px.imshow for heatmaps, and px.scatter for scatter plots.
      - The heatmaps were plotted on the page with Plotly which has built-in interactivity.
- User Story 4.4: Key Features for Sale Price Prediction - Business Requirement 1
  - As a client, I want to identify the key attributes of a house that have the strongest correlation with its potential sale price so that I can make data-driven pricing decisions.
    - Acceptance Criteria:
      - The client should be able to identify the key attributes of a house that have the strongest correlation with its potential sale price.
    - Tasks:
      - Perform pearson and spearman correlation analysis to find the relationship between different features and the sale price.
    - How they were completed:
      - The code df.corr(method="pearson") was used to calculate pearson correlation, and df.corr(method="spearman") to calculate spearman correlation on the DataFrame.
- User Story 4.5: Interactive Prediction Input - Business Requirement 2
  - As a client, I want interactive input fields that allow me to enter custom data so that I can generate personalized predictions for the target variable.
    - Acceptance Criteria:
      - The input fields should allow the user to enter values for each feature or variable that influences the prediction.
      - Each input field must have validation to ensure the entered data is in the correct format.
    - Tasks:
      - Create input fields that allow the user to enter values for each feature or variable that influences the prediction.
      - Create input fields with validation to ensure the entered data is in the correct format.
    - How they were completed:
      - Streamlit widgets were created with st.number_input to allow the user to enter values for each feature or variable that influences the prediction.
      - Input widgets were given a defined min_value and max_value to ensure the input is within a realistic range.
- User Story 4.6: Accurate Sale Price Prediction - Business Requirement 2
  - As a client, I want the most accurate possible prediction of the sale prices for the inherited properties so that I can maximize the financial returns from selling the four houses.
    - Acceptance Criteria:
      - The sale prices of the inherited properties must be accurately predicted.
    - Tasks:
      - Predict the price accurately of the inherited properties.
    - How they were completed:
      - A machine learning regression model was used to accurately predict the price of the inherited properties.
- User Story 4.7: Predictive Model Dashboard - Business Requirement 2
  - As a developer, I need to create a dashboard to effectively visualize and communicate the results of the model's predictions.
    - Acceptance Criteria:
      - A streamlit dashboard must be created.
      - The dashboard must visualize and communicate the results of the model's predictions.
    - Tasks:
      - Create a streamlit dashboard.
      - Create a dashboard that visualizes and communicates the results of the model's predictions.
    - How they were completed:
      - The Python streamlit library was used to create a dashboard.
      - A dashboard was created that displays the model's predictions through DataFrames and a Sales Price calculator.
Dashboard Deployment and Release Epic
- User Story 5.1: Early Deployment on Render - Business Requirement 1 & 2
  - As a developer, I want to initiate the deployment process of my application on Render at an early stage so that I can conduct end-to-end manual deployment testing from the outset.
    - Acceptance Criteria:
      - The application must be successfully deployed to Render.
      - Build and start commands must be correctly configured.
      - The environment variables must be configured correctly for deployment.
      - Deployment is automated with auto-deploy.
    - Tasks:
      - Deploy the application to Render.
      - Define the necessary build and start commands in Render settings.
      - Configure environment variables required for deployment.
      - Enable auto-deploy from the connected repository.
    - How they were completed:
      - A new Web Service was created on Render.
      - The build command was set to pip install -r requirements.txt && ./setup.sh and the start command to streamlit run app.py.
      - Environment variables were set to PORT Value: 8501 and PYTHON_VERSION Value: 3.12.1.
      - Auto-deploy settings were set to Yes.

Hypothesis and how to validate hypothesis

First Hypothesis: The Relationship Between Property Size and Sale Price
- Our first hypothesis posits that the size of a property has a direct and positive influence on its sale price. This assumption is grounded in the widely accepted notion that larger properties tend to offer more space and functionality, which in turn, makes them more attractive to potential buyers. The increased square footage of a property typically allows for additional rooms, larger living areas, and greater customization options, all of which are desirable attributes in a real estate market. Consequently, it is expected that properties with greater size will command higher sale prices due to their enhanced utility and appeal.
  - How to validate hypothesis: We will examine the relationship between house size attributes and sale price to test this hypothesis.
  - Hypothesis Confirmation: Following a rigorous correlation analysis of the dataset, we observed a positive and moderate correlation between the size-related features of the properties and their sale prices. This finding validates our hypothesis, as it indicates that larger properties indeed tend to sell for higher prices. The data clearly supports the notion that, all other factors being equal, the size of a property plays a significant role in determining its market value, confirming our initial assumption.
Second Hypothesis: The Impact of Overall Quality on Sale Price
- Our second hypothesis focuses on the role of a property's overall quality in influencing its sale price. We hypothesize that properties with higher quality ratings, which reflect superior materials, craftsmanship, and design, will be priced higher in the market. Buyers are likely to place a premium on well-constructed homes that offer longevity, comfort, and aesthetic appeal, which in turn boosts their market value. As such, homes with higher quality ratings should be more desirable and consequently demand higher prices.
  - How to validate hypothesis: We will examine the correlations between various attributes related to house quality assessment, such as 'OverallQual' and 'KitchenQual,' in order to validate the hypothesis.
  - Hypothesis Confirmation: After analyzing the data, we confirmed that there is a strong correlation between a property's overall quality rating and its sale price. Homes that received higher quality ratings were consistently priced higher in the market, reinforcing the idea that construction quality plays a pivotal role in determining a property’s value. This analysis supports our hypothesis that factors such as the quality of materials, craftsmanship, and overall design are crucial in shaping buyer perceptions and influencing the final sale price.
Third Hypothesis: The Influence of Property Condition on Market Value
- For our third hypothesis, we investigate how a property's condition affects its sale price. We hypothesize that homes in excellent condition, particularly those that have undergone recent renovations or are newly built, will be more desirable to buyers and therefore will command higher sale prices. The condition of a property often reflects its upkeep and can signal to buyers the level of maintenance and care invested in the home. Properties in better condition are generally perceived as more move-in ready, which makes them more attractive to prospective buyers looking for immediate comfort without the need for costly repairs or improvements.
  - How to validate hypothesis: We will explore the data related to 'YearBuilt' and 'YearRemodAdd' to validate this hypothesis.
  - Hypothesis Confirmation: Our analysis supports this hypothesis by revealing a positive and moderate correlation between sale price and key factors such as the property's construction year and the year of its last remodel. The data suggests that newer homes and those with recent upgrades tend to sell at higher prices, highlighting the importance of property condition in the pricing process. The findings confirm that well-maintained homes or those with modern features are more likely to achieve higher sale prices, underscoring the influence of condition on market value.

Rationale to map the business requirements to the Data Visualisations and ML tasks

Business Requirement 1: Data Visualization & Correlation Analysis
- Conduct a correlation study using Pearson and Spearman correlation coefficients to assess the relationship between house attributes and the target variable, house price.
- Evaluate the significance of these correlations.
- Visualize key variables against house prices to gain insights into their impact.
- This analysis is documented in the following notebook: https://github.com/linobollansee/property-value-maximizer/blob/main/jupyter_notebooks/03%20-%20CorrelationStudy.ipynb
Business Requirement 2: Regression Analysis for Price Prediction
- Since house price is a continuous variable, a regression analysis is performed to build a predictive model.
- If regression models do not meet performance expectations, classification-based approaches may be explored.
- The goal is to predict house prices using key features: OverallQual, GrLivArea, GarageArea, YearBuilt, TotalBsmtSF
- This analysis is detailed in the following notebook: https://github.com/linobollansee/property-value-maximizer/blob/main/jupyter_notebooks/05%20-%20MLModelEvaluation.ipynb

ML Business Case

In this section, we will outline the business case for the machine learning (ML) model, focusing on the project goals, requirements, and methodologies that align with the client’s needs. We will expand on key aspects such as business requirements, the feasibility of using traditional analysis, and the project’s inputs and outputs.

Business Requirements
- BR1: The client wants to understand how various house attributes correlate with the sale price in Ames, Iowa. She expects data visualizations that illustrate the relationships between these variables and the sale price.
- BR2: The client is looking to predict the sale price for her four inherited houses, as well as for any other property in Ames, Iowa.
Can Traditional Data Analysis Be Used?
- Traditional data analysis methods could offer some insights, but they would have limitations:
  - Approximating Sale Prices: One approach would be for the client to manually draw inferences about the sale prices of the inherited houses by comparing them with houses of similar features in the dataset. While this might offer a rough estimate, the approach is inherently subjective and lacks precision. It’s also prone to human error and biases.
  - Subjectivity and Inaccuracy: Traditional methods, such as comparing houses in a simple spreadsheet or using basic statistical measures, may lead to inaccuracies due to the complexity of real estate pricing. Factors such as overall quality, above-ground living area, garage area, the year the property was built, total basement area might not be fully accounted for, leading to imprecise conclusions.
  - Thus, using an ML model is a far more reliable and accurate method for predicting house prices based on multiple variables.
Does the Customer Need a Dashboard or API?
- The client’s requirements lean toward having a dashboard for visualization and predictions:
  - Dashboard Needs: The client does not require an API at this point, as their focus is on visualizing the data and receiving predictions for house prices. The dashboard will provide an interactive way to explore the data, view the correlation of house attributes with sale prices, and input attributes for new houses to receive predicted prices in real-time.
  - User Interaction: A user-friendly dashboard allows the client to easily interact with the model and make predictions for various houses, ensuring that the solution is accessible to users without a technical background.
A Successful Project Outcome
- Success for this project is defined by the following objectives:
  - Accurate Correlation Insights: The client will benefit from an analysis that highlights the most important variables affecting house sale prices. This insight is crucial for pricing strategy, allowing the client to better assess the value of their inherited properties.
  - Predictive Model Success: The client will consider the project a success if the machine learning model accurately predicts house sale prices based on the attributes provided, especially for the four inherited houses. The key is to help the client maximize the sale price for these properties by providing reliable predictions.
Ethical and Privacy Concerns
- The dataset used for this analysis is public, meaning it has been made available by authorities for public use, and no personal or private information is involved.
  - No Privacy Issues: Since the data does not contain sensitive or personally identifiable information (PII), there are no ethical or privacy concerns associated with the use of this dataset.
  - Public Data Sources: As the dataset is openly available for anyone to access, the project operates transparently with no legal or ethical barriers.
EPICS and User Stories for Agile Implementation
- The project is structured using the Agile methodology, with clear EPICS and user stories that break the work into manageable chunks. EPICS refer to the large bodies of work, and user stories outline specific tasks. See the Agile Methodology section for more details.
Does the Data Suggest a Particular Model?
- Based on the nature of the task, where we are predicting a continuous numeric value (sale price), a regression model is most appropriate:
  - Regression for Continuous Output: Regression models are designed to predict continuous outcomes based on input features. For this case, the model will predict the sale price of a house based on its attributes.
Project Inputs and Intended Outputs
- Model Inputs:
  - The model will take house attributes from the dataset. The features of these houses will be used to train the model and make predictions.
- Model Outputs:
  - The output of the model will be the predicted sale price of the house, represented in USD as a continuous numeric value.
  - Additionally, the client will receive the sum of the predicted sale prices for all four inherited houses combined.
- User Interaction:
  - The dashboard will allow users to input house attributes (OverallQual, GrLivArea, GarageArea, YearBuilt, TotalBsmtSF) through interactive widgets. In return, the dashboard will provide the user with an estimated sale price for any given house.
What Does Success Look Like?
- Success is measured based on the following criteria:
  - R-squared Score (R²): A key performance indicator for this project is an R² score of at least 0.75 on both the training and test sets. The R² score measures how well the model’s predictions align with actual values, with 1 being perfect and 0 indicating no correlation. A score of 0.75 or higher indicates that the model can reliably predict sale prices.
How Will the Client Benefit?
- The primary benefit to the client is the ability to:
  - Maximize Sale Price: By using the model’s predictions, the client can optimize the pricing of their inherited properties. The insights from the model will help them determine the most competitive pricing strategy based on market conditions and property features.
  - Efficient Decision-Making: With accurate predictions, the client will make more informed decisions on how to price their houses and potentially increase their profitability. The interactive dashboard also empowers them with the tools to make these decisions in real-time.
  - This project’s outcome will significantly enhance the client’s ability to assess and act on the sale prices of inherited houses.

Cross-industry standard process for data mining

This project applies the CRISP-DM (CRoss Industry Standard Process for Data Mining) methodology.

Phase	Explanation
Business Understanding	This phase focuses on understanding the project objectives and requirements from a business perspective. The goal is to define the problem, set objectives, and determine the data mining goals to achieve business success.
Data Understanding	In this phase, the focus is on collecting initial data and understanding its quality, content, and structure. It involves exploratory data analysis to uncover insights, patterns, and potential issues.
Data Preparation	This phase involves cleaning and transforming raw data into a suitable format for modeling. It includes tasks like dealing with missing data, outlier detection, and feature engineering.
Modeling	In this phase, various data mining techniques (such as classification, regression, clustering, etc.) are applied to the prepared data to create models. It is often an iterative process where models are trained, tested, and refined.
Evaluation	After the model has been built, this phase evaluates its performance based on predefined criteria. The model is assessed to ensure it meets business goals and objectives before it is deployed.
Deployment	The final phase focuses on implementing the data mining solution into the business environment. This includes integrating the model into production systems, delivering results, and monitoring its impact on business processes.

Data Preprocessing

Data Cleaning Pipeline

A data cleaning pipeline was developed to handle missing values. Various imputation methods were applied based on the statistical properties of the variables.

Mean Imputation for Normally Distributed Continuous Variables
- For continuous features such as LotFrontage and BedroomAbvGr, missing values were imputed using the mean. This approach is suitable for variables that follow an approximately normal distribution without significant outliers, as it maintains the overall data distribution without skewing the central tendency.
Median Imputation for Skewed Continuous Variables
- Variables exhibiting right-skewed distributions, such as 2ndFlrSF and MasVnrArea, were imputed using the median. Since the median is less sensitive to extreme values, it provides a more robust imputation strategy for skewed data, preventing artificial distortion of the dataset.
Categorical Variable Imputation with 'None'
- Categorical features like GarageFinish, BsmtFinType1, and BsmtExposure were missing primarily because these attributes did not apply to certain properties (e.g., a house without a basement). To preserve this structural information, missing values were imputed with "None" rather than the mode, ensuring that the absence of a feature is explicitly represented rather than inferred as a common category.
Feature Removal Due to High Missingness
- Features such as EnclosedPorch, GarageYrBlt, and WoodDeckSF contained a substantial proportion of missing values. Rather than imputing them with limited available observations which could introduce bias, these features were removed from the dataset. Their exclusion was justified based on their potential lack of predictive power and the risk of introducing noise into the model.

For imputation rationale, refer to the detailed analysis in the following notebook: https://github.com/linobollansee/property-value-maximizer/blob/main/jupyter_notebooks/02%20-%20DataCleaning.ipynb

Feature Engineering

Categorical encoding

Categorical encoding was applied to convert ordinal categories into numerical values, preserving both the order and hierarchy of the categories. This allowed the regression analysis to account for their relative rankings. However, during the data cleaning process, most ordinal categories were removed.

Numerical Transformations

Feature	Assessment	Applied Transformation
TotalBsmtSF	Mean imputation proved to be the most effective method for handling missing values.	MeanMedianImputer
GrLivArea	A logarithmic transformation was the best approach to achieve normalization.	LogTransformer
TotalBsmtSF	Power transformation yielded the most effective normalization.	PowerTransformer
TotalBsmtSF, GarageArea	Outliers were best handled using Winsorization with the IQR method.	Winsorizer
TotalBsmtSF, GrLivArea, GarageArea	Standard scaling provided the most effective way to normalize feature ranges.	StandardScaler