Skip to content

HackmaniaGX/neural-nexus-healthcare-fwa-analysis

Repository files navigation

πŸ•΅οΈβ€β™€οΈ Healthcare Fraud Detection Using Graph Analysis and AI πŸ€–

🎯 Purpose:
This repository contains our project for the ITAG Atlantec Hackathon 2025 held in Galway. Our team (The Neural Nexus) developed an innovative approach to detect healthcare fraud by leveraging graph analysis and machine learning techniques.

🎯 Solution Goal:
The goal is to improve detection accuracy, transparency, and investigation efficiency in healthcare systems.

🚦 Solution Scope:
This analysis aims to detect healthcare fraud by identifying suspicious claims, providers, and patient patterns. It employs machine learning models like Isolation Forest and Random Forest to flag anomalies, high claim volumes, and abnormal claim amounts. Network analysis visualizes potential collusion, while feature engineering highlights key indicators of fraudulent activity. The goal is to enable early detection of high-risk entities, streamline investigations, and improve fraud prevention efforts, ultimately safeguarding healthcare resources and ensuring system integrity.

πŸ“Š Summary of results
The results conclude the following can be identified during normal claims processing and can be integrated into business operations for further action:

  • πŸ’° High claim amounts and frequent claims flagged as potential fraud.
  • πŸ” Outlier claims identified via anomaly detection.
  • 🌐 Network analysis reveals clusters of providers and patients with suspicious interactions.

πŸ“š Table of Contents


🌟 Introduction

This project analyses healthcare claims data to identify potential fraud patterns. Using synthetic datasets generated via Synthea, we perform exploratory data analysis, feature engineering, anomaly detection, and visualization to uncover suspicious behaviors from patient, provider, and claims perspectives.

πŸ₯ About Healthcare Fraud, Waste, and Abuse (FWA)

Fraud, Waste, and Abuse (FWA) in healthcare represent significant challenges worldwide, leading to billions of euros in unnecessary costs annually. Specifically, in Ireland and across the European Union, healthcare systems are under increasing pressure to optimize resources while maintaining high standards of care.

  • Fraud involves intentional deception or misrepresentation for financial gain, such as falsifying claims or identities.
  • Waste refers to overutilization or inefficient practices that increase costs without improving patient outcomes.
  • Abuse includes practices like upcoding, billing for services not rendered, or unbundling procedures to inflate charges.

Note:
The scope of this project is to focus on detecting fraudulent activity or suspicious relationships between healthcare actors such as providers, patients, pharmacies.

πŸ’‘ Why does this matter?

  • According to the European Healthcare Fraud and Corruption Network (EHFCN), EU member states lose an estimated €56 billion annually due to healthcare fraud and abuse. Source
  • In Ireland, the Health Service Executive (HSE) estimates that up to 10% of healthcare expenditure may be attributable to fraud, waste, or abuse, translating into hundreds of millions of euros annually.
  • The World Bank reports that health sector fraud costs can account for up to 3-5% of total health expenditures in developed countries.

Effective fraud detection helps prevent financial losses, addresses inefficiencies and compliance risks, and rises in the cost of care. This project leverages data analysis and machine learning techniques to identify suspicious claims and activities in synthetic Irish healthcare data.


πŸ”¬ Analysis & Techniques for FWA Detection

1. Understanding FWA Concepts

Concept Explanation Indicator Formula / Metric Interpretation
Fraud βœ… Intentional deception for financial gain Claims with false info or staging False claims, identity theft Excessive claims, suspicious patterns
Waste 🚫 Overutilization or misuse of resources Excessive billing, unnecessary procedures Cost per patient/provider High expenses with no added value
Abuse 🚫 Improper billing practices Upcoding, unbundling Billing codes misuse Discrepancies between services and codes

2. Techniques & Code

  • Outlier Detection (e.g., Isolation Forest)
    • Purpose: Automatically flag claims that deviate significantly from normal patterns.
    • Formula / Concept: Isolation Forest isolates anomalies by randomly partitioning the data space. Fewer splits to isolate a point suggest an anomaly.
    • Interpretation: Claims flagged as -1 are potential frauds or anomalies needing manual review.
  • Threshold-based Flagging:
    • Purpose: Identifies claims exceeding set thresholds for amount or frequency, that would require potential review.
  • Visualization & Network Analysis:
    • Purpose: This reveals clusters of suspicious providers and patients, that would require potential review.

🏁 Getting Started

πŸ“š The Notebooks

The repository includes three main notebooks:

  • healthcare_fwa_analysis.ipynb β€” General FWA analysis with visualizations.
  • healthcare_fwa_graph_analysis.ipynb β€” Graph-based ML approach inspired by research papers.
  • healthcare_fwa_opt2_analysis.ipynb β€” Machine learning to detect suspicious claims.

πŸ—‚οΈ Directory Structure

.
β”œβ”€β”€ notebooks/
β”‚   β”œβ”€β”€ healthcare_fwa_analysis.ipynb
β”‚   β”œβ”€β”€ healthcare_fwa_graph_analysis.ipynb
β”‚   └── healthcare_fwa_opt2_analysis.ipynb
β”œβ”€β”€ data/
β”‚   └── sample_data/
β”‚       └── csv/
β”‚           └── [county folders]
β”œβ”€β”€ containers/
β”‚   β”œβ”€β”€ Dockerfile
β”‚   └── other configs
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ LICENSE.md
└── README.md

πŸ“Š Datasets

This project utilizes a synthetic healthcare dataset generated through the Synthea tool. Synthea is an open-source simulator designed to produce realistic, anonymized patient records based on population models. This approach enables scalable and privacy-preserving research and analysis.

INFO:
Due to limited and insufficient data available from HSE and broader European sources, we opted to use synthetic data generated via Synthea. This allows us to create a comprehensive, realistic dataset that includes diverse patient profiles, claims, and healthcare scenarios necessary for effective fraud detection analysis. Using synthetic data ensures data privacy, enables scalable testing, and provides the detailed information required to develop and validate robust fraud detection models in a controlled environment.

Why Use Synthetic Data?

  • Privacy & Confidentiality: Synthetic data circumvents privacy concerns associated with real patient records.
  • Controlled Environment: Facilitates testing, experimentation, and validation without risking sensitive information.
  • Resource Efficiency: Eliminates the need for complex data access procedures and compliance hurdles.
  • Customizability: Data can be tailored to specific scenarios, demographic distributions, or rare conditions, enhancing research flexibility.

The datasets used include:

  • Patients Data (patients.csv): Demographics, medical history, and socioeconomic info.
  • Claims Data (claims.csv): Claims details such as amount, date, diagnoses, and provider.
  • Claims Transactions (claims_transactions.csv): Detailed transaction info linked to claims.
  • Providers Data (providers.csv): Provider specialties, locations, and contact info.

The datasets are synthetically generated and stored in CSV format, providing a realistic basis for analysis.

The datasets are stored in the following directory structure (within the repo):

data/
└── sample_data/
    └── csv/
        └── [county-specific folders]/
            └── *.csv

Each folder contains CSV files specific to a county. (for e.g: galway, limerick, dublin, cork)

Warning: Using Synthea-generated synthetic datasets enables safe and flexible healthcare data analysis. However, it is crucial to understand and account for the inherent biases to ensure meaningful and responsible insights.

Libraries Used πŸ“š

Library Purpose
scikit-learn 🧠 ML algorithms (Random Forest, Isolation Forest) + evaluation tools
pandas 🐼 Data handling & manipulation
matplotlib / plt 🎨 Static visualizations (charts, distributions)
seaborn πŸ“Š Enhanced statistical plots (correlation, importance)
plotly.express 🌐 Interactive dashboards & visual analytics
sklearn.inspection πŸ” Model interpretability (partial dependence)
numpy βš™οΈ Numerical operations & array management

Prerequisites βœ…

Ensure you have the following installed:

  • Python 3.8+ environment
  • Git & Git LFS
  • Jupyter Notebook (or compatible IDE)
  • Packages from requirements.txt:
pip install -r requirements.txt

Required Python packages:

Install the following packages within jupyter notebooks or python code.

pip install pandas numpy matplotlib seaborn plotly scikit-learn

Clone the repo πŸ›°οΈ

git clone https://github.com/HackmaniaGX/neural-nexus-healthcare-fwa-analysis.git
cd neural-nexus-healthcare-fwa-analysis

Large Files and Git LFS πŸ“¦

This repository uses Git Large File Storage (LFS) for managing large files such as datasets and models. When cloning this repository, ensure that you have Git LFS installed and initialized on your system to fetch these files properly.

To install Git LFS:

# For most systems
git lfs install

After cloning the repository, run:

git lfs pull

Set Up Environment βš™οΈ

Using Docker 🐳

To build and run the container:

docker build -t healthcare-fwa .
docker run -p 8888:8888 -v "$(pwd):/app" healthcare-fwa

This will start Jupyter Notebook server accessible at http://localhost:8888.

Manual Setup

Alternatively, set up a Python environment:

pip install -r requirements.txt

Run Notebooks πŸ“’

Launch Jupyter Notebook in project directory:

jupyter notebook
  • Open the notebooks in the notebooks/ directory:

  • Initial Data Analysis & Visualizations: healthcare_fwa_analysis.ipynb

  • Graph-Based ML Analysis: healthcare_fwa_graph_analysis.ipynb

  • ML-based Detection Algorithm: healthcare_fwa_opt2_analysis.ipynb

Note Update dataset paths if necessary to point to your local data folders.

OR - Running Notebooks on Google Cloud Vertex AI Workbench

For a seamless experience and enhanced computational resources, we recommend running these notebooks within Google Cloud Vertex AI Workbench. This managed environment simplifies setup, scales easily, and provides powerful GPUs/TPUs for large-scale data processing.

Getting Started in Vertex AI Workbench

  1. Create a Vertex AI Workbench Notebook Instance: Follow the Google Cloud documentation to set up a Managed Notebook environment.
  2. Clone the Repository: Use Git within the notebook to clone this repository or upload files directly.
  3. Configure Environment: Install necessary dependencies, either via requirements.txt or conda environments.
  4. Open and Run Notebooks: Launch the notebooks directly from the Vertex AI interface.

Visual Guidance (Screenshots)

Below are some reference screenshots illustrating the setup process and interface:

Vertex AI Home
Fig 1: Vertex AI Workspace Dashboard

Notebook Launch
Fig 2: Opening and running notebooks within the environment

Analysis Workflow πŸ”

Here we are trying to document the few high level steps involved in processing the data.

Data Loading and Preprocessing

import pandas as pd

patients_df = pd.read_csv('path/to/patients.csv')
claims_df = pd.read_csv('path/to/claims.csv')
transactions_df = pd.read_csv('path/to/claims_transactions.csv')
providers_df = pd.read_csv('path/to/providers.csv')

# Convert date columns to datetime
patients_df['BIRTHDATE'] = pd.to_datetime(patients_df['BIRTHDATE'], errors='coerce')
claims_df['SERVICEDATE'] = pd.to_datetime(claims_df['SERVICEDATE'], errors='coerce')
transactions_df['FROMDATE'] = pd.to_datetime(transactions_df['FROMDATE'], errors='coerce')
transactions_df['TODATE'] = pd.to_datetime(transactions_df['TODATE'], errors='coerce')

Data Merging

    # Merge
    merged_df = claims_df.merge(transactions_df, on='CLAIMID', how='left')

Exploratory Data Analysis (EDA)

1. Patients Demographics

1.1 Distribution of Gender
import seaborn as sns
import matplotlib.pyplot as plt

sns.countplot(data=patients_df, x='GENDER')
plt.title('Patient Gender Distribution')
plt.show()
1.2 Age Calculation
import datetime as dt
today = dt.date.today()
patients_df['AGE'] = patients_df['BIRTHDATE'].apply(lambda x: today.year - x.year - ((today.month, today.day) < (x.month, x.day)))
sns.histplot(patients_df['AGE'], bins=20)
plt.title('Patient Age Distribution')
plt.show()
1.3 Claims Overview
# Claim amount distribution
sns.boxplot(x=merged_df['AMOUNT'])
plt.title('Claim Amounts')
plt.show()

# Claims over Time
merged_df['SERVICEDATE'].hist(bins=30)
plt.title('Claims Over Time')
plt.show()

1.4 Provider Analysis

# Top providers by number of claims
provider_claim_counts = merged_df['PROVIDERID'].value_counts().head(10)
provider_claim_counts.plot(kind='bar')
plt.ylabel('Number of Claims')
plt.title('Top 10 Providers by Claim Count')
plt.show()

Feature Engineering & Suspicious Pattern Detection

1. Identify High-Value Claims
high_claim_threshold = merged_df['AMOUNT'].quantile(0.99)
suspicious_claims = merged_df[merged_df['AMOUNT'] > high_claim_threshold]
2. Detect Repetitive & Rapid Claims
merged_df['DATE'] = merged_df['SERVICEDATE'].dt.date
claims_per_provider_day = claims_df.groupby(['PROVIDERID', 'DATE']).size()
suspicious_frequency = claims_per_provider_day[claims_per_provider_day > 10]
3. Anomaly Detection with Isolation Forest
from sklearn.ensemble import IsolationForest
import numpy as np

# Prepare features
features = merged_df[['AMOUNT']]
features.fillna(0, inplace=True

clf = IsolationForest(contamination=0.01, random_state=42)
merged_df['anomaly_score'] = clf.fit_predict(features)

# Filter suspected fraudulent claims
suspects = merged_df[merged_df['anomaly_score'] == -1]

Visualization of Suspicious Activities

import networkx as nx

G = nx.Graph()
for _, row in suspects.iterrows():
    G.add_node(row['PROVIDERID'], type='provider')
    G.add_node(row['PATIENTID'], type='patient')
    G.add_edge(row['PROVIDERID'], row['PATIENTID'])

nx.draw(G, with_labels=True)
plt.title('Suspected Provider-Patient Network')
plt.show()

πŸ’‘ Concepts and Techniques Used

The analysis techniques implemented are specifically designed to identify fraudulent activities within healthcare claims:

  • Suspicious Claim Amounts: Flagging claims with abnormally high amounts or units that are unlikely to be legitimate.
  • Unusual Claim Frequency: Detecting providers or patients with an unusually high number of claims, indicating potential overutilization or collusive behavior.
  • Suspicious Service Codes: Identifying claims associated with known or potentially fraudulent procedure codes.
  • Rapid Claim Submissions: Highlighting providers submitting multiple claims within short timeframes, a pattern often associated with fraud rings.
  • Anomaly Scores from Machine Learning: Utilizing algorithms like Isolation Forest and Random Forest to detect claims that significantly deviate from standard patterns, flagging them as potential fraud cases.
  • Network and Graph Analysis: Revealing organized collusion or fraud rings by visualizing suspicious provider-patient relationships.

These methods collectively enhance the ability to detect, prioritize, and investigate fraudulent activities efficiently, helping to reduce financial losses and protect the integrity of healthcare systems.

πŸ“ Findings & Next Steps

High claim amounts and frequent claims flagged as potential fraud.

Outlier claims identified via anomaly detection.

Network analysis reveals clusters of providers and patients with suspicious interactions.

Next steps include integrating supervised learning models with labeled data, refining feature sets, and developing dashboards for ongoing monitoring.

🎯 Final Notes

This analysis serves as an initial screening tool.

False positives are possible; manual review is essential.

Models should be continuously updated with new data and feedback.

πŸ“Έ Visual Summary

The following images illustrate the key findings from our analysis. These visualizations highlight suspicious patterns, network interactions, outlier claims, and other relevant metrics identified during our investigation. The results demonstrate the effectiveness of our graph-based and machine learning approaches in flagging potential healthcare fraud. Review the images below to gain insights into the detected anomalies and the overall performance of our detection techniques.

Network Suspicion Graph - Derived via Reserach Methods Network Suspicion Graph - Derived via Custom Analysis - Primary Dataset Network Suspicion Graph - Derived via Analysis - Second Dataset Network Suspicion Graph - Derived via Analysis - Second Dataset Plot Plot1 Plot2 Plot3 Plot4

Patient1 Patient2 Patient3 Patient4 Patient5

Claim1 Diagram1 Diagram2 Diagram3 Diagram4 Diagram5

gcp1 gcp2 gcp3 gcp4 gcp5

πŸ”— References & Resources

⚠️ AI-Enhanced Content Notice

WARNING Please note that some content in this repository, including explanations, summaries, and documentation, has been generated or enhanced using AI tools to improve clarity and detail. Users should review and validate these sections as needed.*

πŸ™Œ Authors & Acknowledgments

πŸ“œ License

This project is licensed under the PRIVATE & COPYRIGHTED License. See the LICENSE file for details.

Β© Neural Nexus Team - All rights reserved.

With love for healthcare data analysis and fraud detection.

About

neural-nexus-healthcare-fwa-analysis

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •