Medical Cost Prediction

Overview

Problem Statement:
Medical insurance costs can vary widely based on factors like age, lifestyle, and pre-existing conditions. Accurately predicting these costs helps both insurance companies and individuals to make informed financial decisions. This project aims to build a predictive model that estimates yearly medical insurance costs based on personal attributes, such as age, sex, BMI, number of children, smoking status, and geographic location, using a regression algorithm.

Usecases:
For insurance providers, an accurate prediction model helps in setting premiums that reflect an individual’s health risk, leading to fairer pricing and better risk management. For policyholders, understanding factors that drive their insurance costs can guide them in making healthier lifestyle choices to reduce future premiums. This project’s model could serve as a foundational tool for personalized pricing strategies, helping in financial planning and fostering more transparent relationships between insurers and clients.

Data Source

The dataset used in this project is the Medical Cost Personal Dataset, publicly available on Kaggle and in the Data folder of this project. This dataset includes records of 1,338 individuals and captures key personal and demographic information, such as age, gender, body mass index (BMI), number of dependents, smoking status, and region of residence in the U.S. The primary target variable charges is the medical insurance cost incurred by each individual in a year. Below is a snippet of the dataset:

Dataset Sample

Age	Sex	BMI	Children	Smoker	Region	Charges
19	Female	27.9	0	Yes	Southwest	16,882.67
18	Male	33.8	1	No	Southeast	1,721.35
28	Male	33.0	3	No	Southeast	4,444.43
33	Male	22.7	0	No	Northwest	3,219.86
32	Female	28.9	0	Yes	Southeast	39,323.77

Exploratory Data Analysis

The next step in the machine learning pipeline is data understanding through Exploratory Data Analysis (EDA). This includes checking for missing values and handling them, understanding data distribution and correlation to make informed transformations of the data, and selecting a subset of the available features if necessary. The EDA for this project is available in the jupyter notebook notebook.ipynb. Summary from EDA:

The dataset is clean and has no missing values.
The target variable (charges) has a long tail and needs to be transformed.
Feature selection is not beneficial for this task.
The age variable is the top predictive variable.

Model Selection

For this project, eight regression algorithms were compared before selecting a final one. They are:

Linear Regression
Ridge Regression
Lasso Regression
Elastic Net
Decision Tree Regression
Random Forest Regression
XGBoost
K-Nearest Neighbors Regression

Based on the Mean Squared Error and R2 scores, Random Forest Regression was the best-performing algorithm and was selected to build the final model. The code for the model selection is available in notebook.ipynb.

Model Information

Algorithm: Random Forest Regressor
Features: age, sex, bmi, children, smoker status, region
Target Variable: charges (log-transformed)
Model Performance: 0.3535 (Root Mean Squared Error)

Technical Architecture

App Framework: Flask
ML Framework: Scikit-learn
Model: Random Forest Regressor (max_depth=5, n_estimators=400)
Deployment: AWS Elastic Beanstalk with Docker
API Protocol: REST
Input/Output: JSON

Dependencies

Python 3.11
Flask
scikit-learn
numpy
pandas
gunicorn (for production deployment)

Deployment

Local

Clone the repository:

git clone https://github.com/F-U-Njoku/medical-cost-prediction.git
cd medical-cost-prediction

Install dependencies using Pipenv:

pip install pipenv
pipenv install

Run the application:

pipenv run gunicorn --bind 0.0.0.0:8080 predict:app

Use the application:

pipenv run python test.py

Docker

Build the Docker image:

docker build -t med-cost-predictor .

Run the container:

docker run -p 8080:8080 med-cost-predictor

Use the application:

pipenv run python test.py

Cloud

So far, the application has been used locally, however, it has been deployed to the cloud with AWS Elastic Beanstalk.

Change the URL from local to cloud in the test.py file:

url = f'http://{local}/predict'
url = f'http://{cloud}/predict'

Use the application:

pipenv run python test.py

Example

Request:

applicant = {
    "age": 35,
    "sex": "male",
    "bmi": 26.5,
    "children": 2,
    "smoker": "no",
    "region": "northeast"
}

Output:

The predicted medical cost is 12345.67 yearly.

Change details of the applicant in the test.py file and run it to get new predictions.

Contributing

I welcome contributions to the Medical Cost Prediction project! Here's how you can help:

Fork the repository
Create a new branch (git checkout -b feature/AmazingFeature)
Make your changes
Commit your changes (git commit -m 'Add some AmazingFeature')
Push to the branch (git push origin feature/AmazingFeature)
Open a Pull Request

Please ensure your code adheres to the project's coding standards and includes appropriate tests.

Acknowledgements

Special thanks to the Datatalks club for providing a practical and free course on Machine Learning. Gratitude to Alexey and the entire team for their efforts.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contact

LinkedIn: Uchechukwu Njoku

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
Data		Data
Images		Images
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
model_5_400.bin		model_5_400.bin
notebook.ipynb		notebook.ipynb
predict.py		predict.py
test.py		test.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Medical Cost Prediction

Table of Contents

Overview

Data Source

Exploratory Data Analysis

Model Selection

Model Information

Technical Architecture

Dependencies

Deployment

Local

Docker

Cloud

Example

Contributing

Acknowledgements

License

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Languages

F-U-Njoku/medical-cost-prediction

Folders and files

Latest commit

History

Repository files navigation

Medical Cost Prediction

Table of Contents

Overview

Data Source

Exploratory Data Analysis

Model Selection

Model Information

Technical Architecture

Dependencies

Deployment

Local

Docker

Cloud

Example

Contributing

Acknowledgements

License

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages