Healthcare Fraud Detection Project

What is this project about?

This project explores healthcare claims data to spot fraudulent activities among medical providers. I used data analysis and machine learning methods to catch suspicious patterns and highlight potential fraud.

Where did the data come from?

The dataset is sourced from Kaggle. It includes a few different CSV files:

Beneficiary details
Inpatient claims
Outpatient claims
Labels marking claims suspected of fraud

For more details, you can check out the presentation slides available in this project slides.

What did I try to achieve?

Understand the data better by exploring and visualizing it.
Find important features that help in detecting fraud.
Use machine learning, especially logistic regression, to predict fraud.
Measure how well my models performed using ROC-AUC, precision, recall, and F1-score.

How did I approach this?

Data Cleaning and Preprocessing

Fixed missing values and removed duplicates.
Converted categorical data into numerical format and transformed date fields from object type to datetime format.
Scaled numerical data to make it easier to work with.

Exploratory Data Analysis (EDA)

Looked at statistics and visualizations to understand patterns and outliers.
Created visualizations like scatter plots, histograms, and bar charts to see trends clearly.
Checked important factors like how long claims lasted, the amounts reimbursed, and provider behavior.

Feature Engineering

Created new features like total claim amounts and duration claim.
Identified potential features for future exploration, such as behaviors indicative of double billing, unbundling, phantom billing, and upcoding.

Model Building

Split data into training and testing (80% training, 20% testing).
Used logistic regression with some hyperparameter tuning.

How did I measure success?

Used ROC-AUC to check how effectively my model distinguished between fraudulent and non-fraudulent claims.
Looked at precision, recall, and F1-score to evaluate the balance between finding fraud and avoiding false alarms.

Tools I used

Python
Pandas
NumPy
Matplotlib
Seaborn
Scikit-learn
Jupyter Notebook

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
EDA		EDA
Trial Models		Trial Models
.gitignore		.gitignore
Final Model.ipynb		Final Model.ipynb
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Healthcare Fraud Detection Project

What is this project about?

Where did the data come from?

What did I try to achieve?

How did I approach this?

Data Cleaning and Preprocessing

Exploratory Data Analysis (EDA)

Feature Engineering

Model Building

How did I measure success?

Tools I used

About

Uh oh!

Releases

Packages

Languages

License

ChTomer/MyCapstone

Folders and files

Latest commit

History

Repository files navigation

Healthcare Fraud Detection Project

What is this project about?

Where did the data come from?

What did I try to achieve?

How did I approach this?

Data Cleaning and Preprocessing

Exploratory Data Analysis (EDA)

Feature Engineering

Model Building

How did I measure success?

Tools I used

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages