This project implements an end-to-end data engineering pipeline on Azure to solve challenges in the Healthcare Revenue Cycle Management (RCM) domain. RCM ensures hospitals manage financial processes effectively, from patient registration to claim settlement.
The solution builds an automated Medallion Architecture (Bronze → Silver → Gold) to ingest data from multiple sources (EMR, Claims, APIs), transform it into a common data model (CDM) with Slowly Changing Dimensions (SCD2), and deliver fact/dimension tables for analytics and KPI reporting such as Accounts Receivable > 90 Days and Days in AR.
- Streamline Accounts Receivable (AR) analysis to reduce outstanding collection periods.
- Track KPIs such as:
- AR > 90 days (aging analysis).
- Days in AR (average collection period).
- Enable hospitals to monitor payments, claims, and follow-ups effectively.
- Provide stakeholders with actionable insights to improve revenue cycle efficiency.
Architecture: Medallion Framework (Landing → Bronze → Silver → Gold)
-
Data Sources
- EMR (Azure SQL DB) → Patients, Providers, Transactions, Encounters, Departments
- Claims Data (Flat Files → ADLS Gen2 Landing)
- NPI & ICD Codes (Public APIs → ADLS Gen2 Bronze)
- CPT Codes (Flat Files → ADLS Gen2 Landing)
-
Data Pipeline
- Azure Data Factory (ADF) → Orchestration & metadata-driven ingestion
- Azure Data Lake Gen2 (ADLS) → Landing, Bronze, Silver, Gold storage
- Azure Databricks → Transformations, CDM, SCD2 implementation
- Delta Lake → Optimized storage for Silver & Gold layers
- Azure Key Vault → Secure credential management
-
Medallion Layers
- Landing → Raw Claims/Flat Files, API dumps
- Bronze → Source-of-truth in Parquet format
- Silver → Data Cleaning, SCD2, Common Data Model (CDM), Quality Checks
- Gold → Aggregated Fact & Dimension tables for BI & analytics
-
Consumption
- Gold layer feeds into BI/Analytics teams for KPI dashboards (e.g., AR trends, Payment delays).
- Azure Data Factory (ADF) – Orchestration, Scheduling, Metadata-driven pipelines
- Azure Data Lake Gen2 (ADLS) – Data storage (Landing/Bronze/Silver/Gold)
- Azure Databricks (PySpark, Delta Lake) – Data transformation, SCD2, CDM
- Azure SQL DB – Source EMR system
- Azure Key Vault – Secrets & credential management
- Unity Catalog (Databricks) – Governance & cataloging
- Power BI (Future Scope) – KPI Reporting & visualization
- Automated Ingestion from SQL DB, Flat Files, and APIs.
- Audit Logging to track loads and incremental pipelines.
- Incremental & Full Loads based on config-driven pipelines.
- SCD2 Implementation for patient, transactions, and claims history.
- Data Quality Checks (quarantine invalid records).
- Parallelized Pipelines for efficiency.
- Security & Governance via Key Vault and Unity Catalog.
azureProject/
│── landing/
│── bronze/
│── silver/
│── gold/
│── configs/
└── emr/load_config.csv
- % of AR > 90 days = (AR > 90 days) / Total AR
- Days in AR = Total AR / Avg. Daily Charges
- Provider-level AR aging trends
- Claim settlement rates (Insurance vs Patient)
- Enabled hospitals to reduce AR aging with accurate insights.
- Provided fact/dimension models for scalable BI reporting.
- Automated pipelines reduced manual efforts in ETL by >60%.
- Ensured secure, governed, and reliable data delivery for RCM analytics.