This project demonstrates Multiple Linear Regression using the popular California Housing
dataset from sklearn.datasets
. It explores feature relationships, evaluates model performance using multiple metrics, and finally prepares the model for deployment using Pickling.
Multiple Linear Regression helps us model the relationship between one dependent variable (target) and multiple independent variables (features). In this project, we aim to predict housing prices based on various features from the California Housing dataset.
- ๐ฆ Source:
sklearn.datasets.fetch_california_housing
- ๐งฎ Samples: 20,000+
- ๐ข Features: 8 numerical features
- ๐ฏ Target:
Price
(Median House Value)
-
Load Dataset & Create DataFrame
- Loaded using
fetch_california_housing()
- Converted to a pandas DataFrame
- Loaded using
-
Exploratory Data Analysis
- Used
seaborn.pairplot()
to visualize relationships - Created a
heatmap
to observe feature correlations
- Used
-
Data Preparation
- Split into train/test sets using
train_test_split
- Standardized features using
StandardScaler
- Split into train/test sets using
-
Model Building
- Trained a Multiple Linear Regression model using
LinearRegression
fromscikit-learn
- Trained a Multiple Linear Regression model using
-
Model Evaluation
- Evaluated with:
- Mean Squared Error (MSE)
- Mean Absolute Error (MAE)
- Root Mean Squared Error (RMSE)
- Rยฒ Score
- Adjusted Rยฒ Score
- Evaluated with:
-
Assumptions & Residual Analysis
- Plotted residuals using:
seaborn.distplot()
to check normality- Scatter plot of residuals vs predictions to check homoscedasticity
- Found that model accuracy could be improved; performance wasn't optimal
- Plotted residuals using:
-
Model Deployment Prep
- Exported the trained model using Pickling (
pickle.dump
) - Discussed its usage in cloud-based inference pipelines
- Exported the trained model using Pickling (
Library | Purpose |
---|---|
pandas |
Data handling |
numpy |
Numerical computation |
seaborn |
Visualization (pairplot, heatmap) |
matplotlib |
Plotting |
sklearn |
Dataset loading, ML models, metrics |
pickle |
Model serialization |
- ๐ MSE โ Mean Squared Error
- ๐ MAE โ Mean Absolute Error
- ๐ RMSE โ Root Mean Squared Error
- ๐ Rยฒ Score โ Goodness of fit
- ๐ Adjusted Rยฒ โ Rยฒ adjusted for number of features
File Name | Description |
---|---|
Multiple_Linear_Regression.ipynb |
Full model implementation and evaluation |
README.md |
Project documentation (this file) |
model.pkl |
Serialized (pickled) trained model |
-
Clone the Repository
git clone https://github.com/YourUsername/Multiple-Linear-Regression-California.git cd Multiple-Linear-Regression-California
-
Install required libraries
pip install pandas numpy matplotlib seaborn scikit-learn
-
Launch Jupyter Notebook
jupyter notebook
-
Open ipynb files and run through the cells.
To deploy this model on the cloud:
-
Load the model.pkl file in your API/backend
-
Use libraries like Flask, FastAPI, or cloud services like AWS Lambda / Azure Functions
-
Standardize incoming input data exactly as done before training
-
Perform prediction using:
import pickle model = pickle.load(open("model.pkl", "rb")) prediction = model.predict(new_scaled_data)
Maitri Prabhu
GitHub: Mai3Prabhu