Skip to content

Predicts yearly medical insurance costs based on personal factors like age, BMI, and smoking status using a Random Forest model. Includes data analysis, model selection, and deployment with Flask and AWS Elastic Beanstalk.

Notifications You must be signed in to change notification settings

F-U-Njoku/medical-cost-prediction

Repository files navigation

Medical Cost Prediction

Medical Insurance.

Python 3.8+ Docker Elastic Beanstalk License: MIT

Table of Contents

Overview

Problem Statement:
Medical insurance costs can vary widely based on factors like age, lifestyle, and pre-existing conditions. Accurately predicting these costs helps both insurance companies and individuals to make informed financial decisions. This project aims to build a predictive model that estimates yearly medical insurance costs based on personal attributes, such as age, sex, BMI, number of children, smoking status, and geographic location, using a regression algorithm.

Usecases:
For insurance providers, an accurate prediction model helps in setting premiums that reflect an individual’s health risk, leading to fairer pricing and better risk management. For policyholders, understanding factors that drive their insurance costs can guide them in making healthier lifestyle choices to reduce future premiums. This project’s model could serve as a foundational tool for personalized pricing strategies, helping in financial planning and fostering more transparent relationships between insurers and clients.

Data Source

The dataset used in this project is the Medical Cost Personal Dataset, publicly available on Kaggle and in the Data folder of this project. This dataset includes records of 1,338 individuals and captures key personal and demographic information, such as age, gender, body mass index (BMI), number of dependents, smoking status, and region of residence in the U.S. The primary target variable charges is the medical insurance cost incurred by each individual in a year. Below is a snippet of the dataset:


Dataset Sample

Age Sex BMI Children Smoker Region Charges
19 Female 27.9 0 Yes Southwest 16,882.67
18 Male 33.8 1 No Southeast 1,721.35
28 Male 33.0 3 No Southeast 4,444.43
33 Male 22.7 0 No Northwest 3,219.86
32 Female 28.9 0 Yes Southeast 39,323.77

Exploratory Data Analysis

The next step in the machine learning pipeline is data understanding through Exploratory Data Analysis (EDA). This includes checking for missing values and handling them, understanding data distribution and correlation to make informed transformations of the data, and selecting a subset of the available features if necessary. The EDA for this project is available in the jupyter notebook notebook.ipynb. Summary from EDA:

  • The dataset is clean and has no missing values.
  • The target variable (charges) has a long tail and needs to be transformed.
  • Feature selection is not beneficial for this task.
  • The age variable is the top predictive variable.

Model Selection

For this project, eight regression algorithms were compared before selecting a final one. They are:

  • Linear Regression
  • Ridge Regression
  • Lasso Regression
  • Elastic Net
  • Decision Tree Regression
  • Random Forest Regression
  • XGBoost
  • K-Nearest Neighbors Regression
Root Mean Squared Error. R2.

Based on the Mean Squared Error and R2 scores, Random Forest Regression was the best-performing algorithm and was selected to build the final model. The code for the model selection is available in notebook.ipynb.

Model Information

  • Algorithm: Random Forest Regressor
  • Features: age, sex, bmi, children, smoker status, region
  • Target Variable: charges (log-transformed)
  • Model Performance: 0.3535 (Root Mean Squared Error)

Technical Architecture

  • App Framework: Flask
  • ML Framework: Scikit-learn
  • Model: Random Forest Regressor (max_depth=5, n_estimators=400)
  • Deployment: AWS Elastic Beanstalk with Docker
  • API Protocol: REST
  • Input/Output: JSON

Dependencies

  • Python 3.11
  • Flask
  • scikit-learn
  • numpy
  • pandas
  • gunicorn (for production deployment)

Deployment

Local

  1. Clone the repository:
git clone https://github.com/F-U-Njoku/medical-cost-prediction.git
cd medical-cost-prediction
  1. Install dependencies using Pipenv:
pip install pipenv
pipenv install
  1. Run the application:
pipenv run gunicorn --bind 0.0.0.0:8080 predict:app
  1. Use the application:
pipenv run python test.py

Docker

  1. Build the Docker image:
docker build -t med-cost-predictor .
  1. Run the container:
docker run -p 8080:8080 med-cost-predictor
  1. Use the application:
pipenv run python test.py

Cloud

So far, the application has been used locally, however, it has been deployed to the cloud with AWS Elastic Beanstalk.

Elastic Beanstalk.

  1. Change the URL from local to cloud in the test.py file:
url = f'http://{local}/predict'
url = f'http://{cloud}/predict'
  1. Use the application:
pipenv run python test.py

Example

Request:

applicant = {
    "age": 35,
    "sex": "male",
    "bmi": 26.5,
    "children": 2,
    "smoker": "no",
    "region": "northeast"
}

Output:

The predicted medical cost is 12345.67 yearly.

Change details of the applicant in the test.py file and run it to get new predictions.

Contributing

I welcome contributions to the Medical Cost Prediction project! Here's how you can help:

  1. Fork the repository
  2. Create a new branch (git checkout -b feature/AmazingFeature)
  3. Make your changes
  4. Commit your changes (git commit -m 'Add some AmazingFeature')
  5. Push to the branch (git push origin feature/AmazingFeature)
  6. Open a Pull Request

Please ensure your code adheres to the project's coding standards and includes appropriate tests.

Acknowledgements

Special thanks to the Datatalks club for providing a practical and free course on Machine Learning. Gratitude to Alexey and the entire team for their efforts.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contact

About

Predicts yearly medical insurance costs based on personal factors like age, BMI, and smoking status using a Random Forest model. Includes data analysis, model selection, and deployment with Flask and AWS Elastic Beanstalk.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published