Skip to content

This Spark-based ETL pipeline processes insurance claim and contract data. It reads parameters from the configuration file, uses source data, performs transformations, and outputs transaction-level data in Parquet format.

Notifications You must be signed in to change notification settings

KSwaviman/Claims_ETL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Claims Transformation Pipeline

This Spark-based ETL pipeline processes insurance claim and contract data. It reads parameters from the configuration file, uses source data, performs transformations, and outputs transaction-level data in Parquet format.


Project Structure

.
├── config/
│   └── config.json             # Configuration file with input/output paths
├── main.py                     # Entry script that ties everything together
├── src/
│   ├── transform.py            # Core transformation logic
│   ├── io_base.py              # Abstract class defining IO interface
│   ├── io_local.py             # Concrete implementation for local CSV + Parquet
│   └── utils/
│       ├── logger.py           # Logger setup for file-based logging
│       ├── spark_utils.py      # SparkSession builder
│       └── utils.py            # Utility to load configuration
├── data/
│   ├── contract.csv            # Sample input contract data
│   └── claim.csv               # Sample input claim data
├── output/
│   └── transactions.parquet    # Final transformed output in Parquet format
├── tests/
│   ├── conftest.py             # Test fixtures
│   ├── test_transform.py       # Unit tests for transform logic
│   └── test_utils.py           # Unit tests for utility functions
├── requirements.txt            # Python package requirements
├── README.md                   # Project overview and instructions
└── exploration.ipynb           # Exploratory notebook (for local testing)


⚙️ How to Run

  1. Install dependencies:
pip install -r requirements.txt
  1. Run the pipeline locally:
python main.py

🔢 Pipeline Logic

  • Reads contract and claim CSVs using LocalCSVParquetIO

  • Joins on CONTRACT_ID and SOURCE_SYSTEM

  • Applies transformations such as:

    • Type casting
    • Regex parsing
    • Standardized timestamp formatting
  • Calls external API to derive NSE_ID

  • Saves final DataFrame as a Parquet file.


📃 Config Example (config.json)

{
  "input": {
    "contract": "data/contract.csv",
    "claim": "data/claim.csv"
  },
  "output": {
    "transactions": "output/transactions.parquet"
  },
  "api": {
    "nse_url_template": "https://api.hashify.net/hash/md4/hex?value={CLAIM_ID}"
  }
}

✅ Unit Testing

Test scripts are located in the tests/ folder. To run all tests:

pytest tests/

🚀 Output

  • Enriched transaction data is written to the path specified in config.json as a Parquet file.
  • Sample schema includes fields like CLAIM_ID, TRANSACTION_TYPE, CONFORMED_VALUE, NSE_ID, and timestamp columns.

🔍 Logs

  • Logs are written to the logs/ folder using timestamped filenames.

Proposal: Handling New Batches Without Overwriting Previous Transactions

One idea to maintain new data without overwriting the existing records is this.

  1. Before merging into the table we apply logic to identify and filter out duplicate transactions based on a unique key (e.g. CLAIM_ID + CONTRACT_ID). Maintain a unque view through a Spark SQL view or Delta Lake merge logic.
  2. Another approach can be, we store the final data in a partitioned manner with the use of an identifier column or hashes(e.g. SHA256) for each row. It can be batch id or ingestion timestamp or any identifier to mark the entry batches. That way we can distinguish between upcoming records and existing records. We do a left anti join to skip the existing.

About

This Spark-based ETL pipeline processes insurance claim and contract data. It reads parameters from the configuration file, uses source data, performs transformations, and outputs transaction-level data in Parquet format.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published