π― Purpose:
This repository contains our project for the ITAG Atlantec Hackathon 2025 held in Galway. Our team (The Neural Nexus) developed an innovative approach to detect healthcare fraud by leveraging graph analysis and machine learning techniques.
π― Solution Goal:
The goal is to improve detection accuracy, transparency, and investigation efficiency in healthcare systems.
π¦ Solution Scope:
This analysis aims to detect healthcare fraud by identifying suspicious claims, providers, and patient patterns. It employs machine learning models like Isolation Forest and Random Forest to flag anomalies, high claim volumes, and abnormal claim amounts. Network analysis visualizes potential collusion, while feature engineering highlights key indicators of fraudulent activity. The goal is to enable early detection of high-risk entities, streamline investigations, and improve fraud prevention efforts, ultimately safeguarding healthcare resources and ensuring system integrity.
π Summary of results
The results conclude the following can be identified during normal claims processing and can be integrated into business operations for further action:
- π° High claim amounts and frequent claims flagged as potential fraud.
- π Outlier claims identified via anomaly detection.
- π Network analysis reveals clusters of providers and patients with suspicious interactions.
- Introduction
- Analysis & Techniques for FWA Detection
- Getting Started
- Analysis Workflow
- References & Resources
- Authors & Acknowledgments
- License
This project analyses healthcare claims data to identify potential fraud patterns. Using synthetic datasets generated via Synthea, we perform exploratory data analysis, feature engineering, anomaly detection, and visualization to uncover suspicious behaviors from patient, provider, and claims perspectives.
Fraud, Waste, and Abuse (FWA) in healthcare represent significant challenges worldwide, leading to billions of euros in unnecessary costs annually. Specifically, in Ireland and across the European Union, healthcare systems are under increasing pressure to optimize resources while maintaining high standards of care.
- Fraud involves intentional deception or misrepresentation for financial gain, such as falsifying claims or identities.
- Waste refers to overutilization or inefficient practices that increase costs without improving patient outcomes.
- Abuse includes practices like upcoding, billing for services not rendered, or unbundling procedures to inflate charges.
Note:
The scope of this project is to focus on detecting fraudulent activity or suspicious relationships between healthcare actors such as providers, patients, pharmacies.
- According to the European Healthcare Fraud and Corruption Network (EHFCN), EU member states lose an estimated β¬56 billion annually due to healthcare fraud and abuse. Source
- In Ireland, the Health Service Executive (HSE) estimates that up to 10% of healthcare expenditure may be attributable to fraud, waste, or abuse, translating into hundreds of millions of euros annually.
- The World Bank reports that health sector fraud costs can account for up to 3-5% of total health expenditures in developed countries.
Effective fraud detection helps prevent financial losses, addresses inefficiencies and compliance risks, and rises in the cost of care. This project leverages data analysis and machine learning techniques to identify suspicious claims and activities in synthetic Irish healthcare data.
Concept | Explanation | Indicator | Formula / Metric | Interpretation |
---|---|---|---|---|
Fraud β | Intentional deception for financial gain | Claims with false info or staging | False claims, identity theft | Excessive claims, suspicious patterns |
Waste π« | Overutilization or misuse of resources | Excessive billing, unnecessary procedures | Cost per patient/provider | High expenses with no added value |
Abuse π« | Improper billing practices | Upcoding, unbundling | Billing codes misuse | Discrepancies between services and codes |
- Outlier Detection (e.g., Isolation Forest)
- Purpose: Automatically flag claims that deviate significantly from normal patterns.
- Formula / Concept: Isolation Forest isolates anomalies by randomly partitioning the data space. Fewer splits to isolate a point suggest an anomaly.
- Interpretation: Claims flagged as -1 are potential frauds or anomalies needing manual review.
- Threshold-based Flagging:
- Purpose: Identifies claims exceeding set thresholds for amount or frequency, that would require potential review.
- Visualization & Network Analysis:
- Purpose: This reveals clusters of suspicious providers and patients, that would require potential review.
The repository includes three main notebooks:
healthcare_fwa_analysis.ipynb
β General FWA analysis with visualizations.healthcare_fwa_graph_analysis.ipynb
β Graph-based ML approach inspired by research papers.healthcare_fwa_opt2_analysis.ipynb
β Machine learning to detect suspicious claims.
.
βββ notebooks/
β βββ healthcare_fwa_analysis.ipynb
β βββ healthcare_fwa_graph_analysis.ipynb
β βββ healthcare_fwa_opt2_analysis.ipynb
βββ data/
β βββ sample_data/
β βββ csv/
β βββ [county folders]
βββ containers/
β βββ Dockerfile
β βββ other configs
βββ requirements.txt
βββ LICENSE.md
βββ README.md
This project utilizes a synthetic healthcare dataset generated through the Synthea tool. Synthea is an open-source simulator designed to produce realistic, anonymized patient records based on population models. This approach enables scalable and privacy-preserving research and analysis.
INFO:
Due to limited and insufficient data available from HSE and broader European sources, we opted to use synthetic data generated via Synthea. This allows us to create a comprehensive, realistic dataset that includes diverse patient profiles, claims, and healthcare scenarios necessary for effective fraud detection analysis. Using synthetic data ensures data privacy, enables scalable testing, and provides the detailed information required to develop and validate robust fraud detection models in a controlled environment.
- Privacy & Confidentiality: Synthetic data circumvents privacy concerns associated with real patient records.
- Controlled Environment: Facilitates testing, experimentation, and validation without risking sensitive information.
- Resource Efficiency: Eliminates the need for complex data access procedures and compliance hurdles.
- Customizability: Data can be tailored to specific scenarios, demographic distributions, or rare conditions, enhancing research flexibility.
The datasets used include:
- Patients Data (patients.csv): Demographics, medical history, and socioeconomic info.
- Claims Data (claims.csv): Claims details such as amount, date, diagnoses, and provider.
- Claims Transactions (claims_transactions.csv): Detailed transaction info linked to claims.
- Providers Data (providers.csv): Provider specialties, locations, and contact info.
The datasets are synthetically generated and stored in CSV format, providing a realistic basis for analysis.
The datasets are stored in the following directory structure (within the repo):
data/
βββ sample_data/
βββ csv/
βββ [county-specific folders]/
βββ *.csv
Each folder contains CSV files specific to a county. (for e.g: galway, limerick, dublin, cork)
Warning: Using Synthea-generated synthetic datasets enables safe and flexible healthcare data analysis. However, it is crucial to understand and account for the inherent biases to ensure meaningful and responsible insights.
Library | Purpose |
---|---|
scikit-learn π§ |
ML algorithms (Random Forest, Isolation Forest) + evaluation tools |
pandas πΌ |
Data handling & manipulation |
matplotlib / plt π¨ |
Static visualizations (charts, distributions) |
seaborn π |
Enhanced statistical plots (correlation, importance) |
plotly.express π |
Interactive dashboards & visual analytics |
sklearn.inspection π |
Model interpretability (partial dependence) |
numpy βοΈ |
Numerical operations & array management |
Ensure you have the following installed:
- Python 3.8+ environment
- Git & Git LFS
- Jupyter Notebook (or compatible IDE)
- Packages from
requirements.txt
:
pip install -r requirements.txt
Install the following packages within jupyter notebooks or python code.
pip install pandas numpy matplotlib seaborn plotly scikit-learn
git clone https://github.com/HackmaniaGX/neural-nexus-healthcare-fwa-analysis.git
cd neural-nexus-healthcare-fwa-analysis
This repository uses Git Large File Storage (LFS) for managing large files such as datasets and models. When cloning this repository, ensure that you have Git LFS installed and initialized on your system to fetch these files properly.
To install Git LFS:
# For most systems
git lfs install
git lfs pull
To build and run the container:
docker build -t healthcare-fwa .
docker run -p 8888:8888 -v "$(pwd):/app" healthcare-fwa
This will start Jupyter Notebook server accessible at http://localhost:8888.
Alternatively, set up a Python environment:
pip install -r requirements.txt
Launch Jupyter Notebook in project directory:
jupyter notebook
-
Open the notebooks in the notebooks/ directory:
-
Initial Data Analysis & Visualizations:
healthcare_fwa_analysis.ipynb
-
Graph-Based ML Analysis:
healthcare_fwa_graph_analysis.ipynb
-
ML-based Detection Algorithm:
healthcare_fwa_opt2_analysis.ipynb
Note Update dataset paths if necessary to point to your local data folders.
For a seamless experience and enhanced computational resources, we recommend running these notebooks within Google Cloud Vertex AI Workbench. This managed environment simplifies setup, scales easily, and provides powerful GPUs/TPUs for large-scale data processing.
- Create a Vertex AI Workbench Notebook Instance: Follow the Google Cloud documentation to set up a Managed Notebook environment.
- Clone the Repository: Use Git within the notebook to clone this repository or upload files directly.
- Configure Environment: Install necessary dependencies, either via
requirements.txt
or conda environments. - Open and Run Notebooks: Launch the notebooks directly from the Vertex AI interface.
Below are some reference screenshots illustrating the setup process and interface:
Fig 1: Vertex AI Workspace Dashboard
Fig 2: Opening and running notebooks within the environment
Here we are trying to document the few high level steps involved in processing the data.
import pandas as pd
patients_df = pd.read_csv('path/to/patients.csv')
claims_df = pd.read_csv('path/to/claims.csv')
transactions_df = pd.read_csv('path/to/claims_transactions.csv')
providers_df = pd.read_csv('path/to/providers.csv')
# Convert date columns to datetime
patients_df['BIRTHDATE'] = pd.to_datetime(patients_df['BIRTHDATE'], errors='coerce')
claims_df['SERVICEDATE'] = pd.to_datetime(claims_df['SERVICEDATE'], errors='coerce')
transactions_df['FROMDATE'] = pd.to_datetime(transactions_df['FROMDATE'], errors='coerce')
transactions_df['TODATE'] = pd.to_datetime(transactions_df['TODATE'], errors='coerce')
# Merge
merged_df = claims_df.merge(transactions_df, on='CLAIMID', how='left')
import seaborn as sns
import matplotlib.pyplot as plt
sns.countplot(data=patients_df, x='GENDER')
plt.title('Patient Gender Distribution')
plt.show()
import datetime as dt
today = dt.date.today()
patients_df['AGE'] = patients_df['BIRTHDATE'].apply(lambda x: today.year - x.year - ((today.month, today.day) < (x.month, x.day)))
sns.histplot(patients_df['AGE'], bins=20)
plt.title('Patient Age Distribution')
plt.show()
# Claim amount distribution
sns.boxplot(x=merged_df['AMOUNT'])
plt.title('Claim Amounts')
plt.show()
# Claims over Time
merged_df['SERVICEDATE'].hist(bins=30)
plt.title('Claims Over Time')
plt.show()
# Top providers by number of claims
provider_claim_counts = merged_df['PROVIDERID'].value_counts().head(10)
provider_claim_counts.plot(kind='bar')
plt.ylabel('Number of Claims')
plt.title('Top 10 Providers by Claim Count')
plt.show()
high_claim_threshold = merged_df['AMOUNT'].quantile(0.99)
suspicious_claims = merged_df[merged_df['AMOUNT'] > high_claim_threshold]
merged_df['DATE'] = merged_df['SERVICEDATE'].dt.date
claims_per_provider_day = claims_df.groupby(['PROVIDERID', 'DATE']).size()
suspicious_frequency = claims_per_provider_day[claims_per_provider_day > 10]
from sklearn.ensemble import IsolationForest
import numpy as np
# Prepare features
features = merged_df[['AMOUNT']]
features.fillna(0, inplace=True
clf = IsolationForest(contamination=0.01, random_state=42)
merged_df['anomaly_score'] = clf.fit_predict(features)
# Filter suspected fraudulent claims
suspects = merged_df[merged_df['anomaly_score'] == -1]
import networkx as nx
G = nx.Graph()
for _, row in suspects.iterrows():
G.add_node(row['PROVIDERID'], type='provider')
G.add_node(row['PATIENTID'], type='patient')
G.add_edge(row['PROVIDERID'], row['PATIENTID'])
nx.draw(G, with_labels=True)
plt.title('Suspected Provider-Patient Network')
plt.show()
The analysis techniques implemented are specifically designed to identify fraudulent activities within healthcare claims:
- Suspicious Claim Amounts: Flagging claims with abnormally high amounts or units that are unlikely to be legitimate.
- Unusual Claim Frequency: Detecting providers or patients with an unusually high number of claims, indicating potential overutilization or collusive behavior.
- Suspicious Service Codes: Identifying claims associated with known or potentially fraudulent procedure codes.
- Rapid Claim Submissions: Highlighting providers submitting multiple claims within short timeframes, a pattern often associated with fraud rings.
- Anomaly Scores from Machine Learning: Utilizing algorithms like Isolation Forest and Random Forest to detect claims that significantly deviate from standard patterns, flagging them as potential fraud cases.
- Network and Graph Analysis: Revealing organized collusion or fraud rings by visualizing suspicious provider-patient relationships.
These methods collectively enhance the ability to detect, prioritize, and investigate fraudulent activities efficiently, helping to reduce financial losses and protect the integrity of healthcare systems.
High claim amounts and frequent claims flagged as potential fraud.
Outlier claims identified via anomaly detection.
Network analysis reveals clusters of providers and patients with suspicious interactions.
Next steps include integrating supervised learning models with labeled data, refining feature sets, and developing dashboards for ongoing monitoring.
This analysis serves as an initial screening tool.
False positives are possible; manual review is essential.
Models should be continuously updated with new data and feedback.
The following images illustrate the key findings from our analysis. These visualizations highlight suspicious patterns, network interactions, outlier claims, and other relevant metrics identified during our investigation. The results demonstrate the effectiveness of our graph-based and machine learning approaches in flagging potential healthcare fraud. Review the images below to gain insights into the detected anomalies and the overall performance of our detection techniques.
- Synthea: Synthetic Healthcare Data Generation
- Main GitHub repo: https://github.com/synthetichealth/synthea
- International extension: https://github.com/synthetichealth/synthea-international
- Generating Data for Ireland (IE):
- Synthea can be customized for Irelandβs healthcare systems, demographics, and coding standards.
- See README.md on respective repo for for setup instructions.
- Documentation & Code: Synthea documentation: https://github.com/synthetichealth/synthea/wiki
WARNING Please note that some content in this repository, including explanations, summaries, and documentation, has been generated or enhanced using AI tools to improve clarity and detail. Users should review and validate these sections as needed.*
- Nithin Mohan T K- @nithinmohantk
- Inspiration from Research Paper - Graph Analysis for Detecting Fraud, Waste, and Abuse in Healthcare Data by Juan Liu, Eric Bier, Aaron Wilson, Tomo Honda, Sricharan Kumar, Leilani Gilpin, John Guerra-Gomez and Daniel Davies - Palo Alto Research Center
- Contributions from Neural Nexus Team - David Mullins, Anna Coyle, Dovile Janusauskaite & Nithin Mohan
This project is licensed under the PRIVATE & COPYRIGHTED License. See the LICENSE file for details.
Β© Neural Nexus Team - All rights reserved.
With love for healthcare data analysis and fraud detection.