Skip to content

arpan65/aws-insurance-data-lake-cdk-glue

Repository files navigation

Insurance Data Lake with AWS CDK and Glue

Project Overview

This project builds a data lake architecture for insurance data using AWS CDK for infrastructure as code and AWS Glue for ETL pipelines. It extracts incremental data from an RDS MySQL database, stages it in an S3 raw zone, transforms it into a processed zone, and prepares curated data for analytics.

Architecture Diagram

Components

  • VPC: A public subnet-only VPC (no NAT gateways) to host resources.
  • Amazon RDS (MySQL): Source database with insurance data.
  • S3 Buckets:
    • RawZoneBucket: Stores raw data extracts as Parquet files partitioned by date.
    • ProcessedZoneBucket: Stores cleaned and processed data.
    • CuratedZoneBucket: Stores analytics-ready curated datasets.
  • AWS Glue:
    • Glue Jobs:
      • Incremental extract from RDS → Raw S3 bucket.
      • Transform Raw → Processed S3 bucket.
    • Glue Job Role with permissions on S3, DynamoDB, and Glue.
  • DynamoDB Table: Tracks last processed timestamp per table for incremental ETL.
  • IAM Roles: For secure Glue job access.
  • Glue Script Assets: Python ETL scripts for incremental extraction and transformation.

Prerequisites

  • AWS CLI configured with appropriate IAM permissions.
  • Node.js and AWS CDK installed.
  • AWS credentials set up.
  • Java installed (required by CDK).

Deployment Steps

  1. Clone the repo:

    git clone <your-repo-url>
    cd <repo-directory>
  2. Install dependencies:

    npm install
  3. Bootstrap the CDK environment (if needed):

    cdk bootstrap
  4. Deploy the stack:

    cdk deploy

    This deploys:

    • VPC
    • S3 buckets (raw, processed, curated)
    • RDS MySQL instance
    • Glue Job roles and Glue Jobs
    • DynamoDB state tracking table

Glue ETL Scripts

  • insurance_etl_incremental.py: Extracts new/updated records from RDS incrementally based on a timestamp column, writes to raw S3 bucket partitioned by date.
  • raw_to_processed.py: Reads raw data, performs cleaning/transformation, writes to processed S3 bucket.
  • processed_to_curated.py: Transforms processed data into curated datasets optimized for analytics.

Scripts are uploaded automatically as CDK assets.


Running the ETL

  • Trigger Glue jobs manually or automate with EventBridge or Step Functions.
  • Glue jobs read the DynamoDB state table to track incremental processing.
  • Data lands in S3 as partitioned Parquet files for efficient querying.

Glue Crawlers and Data Catalog

  • Glue Crawlers scan the raw, processed, and curated S3 buckets to update the Glue Data Catalog with current table schemas.
  • This enables seamless querying through Amazon Athena or Redshift Spectrum.
  • Schema changes are handled with Glue's schema update policies (updateBehavior: LOG, deleteBehavior: LOG), allowing the catalog to evolve safely with incoming data changes.

Orchestration with AWS Step Functions

  • All Glue jobs and crawlers are orchestrated with Step Functions to ensure workflow sequencing.
  • Implemented retry logic with exponential backoff to handle transient failures (up to 2 retries with 10-second intervals).
  • Errors caught by a Lambda function that publishes notifications to an SNS topic for real-time alerting.

Data Governance with Lake Formation (Optional)

  • Lake Formation can be integrated for fine-grained access control and centralized data governance across the data lake.
  • This project structure supports adding Lake Formation permissions on S3 buckets and Glue Data Catalog tables.

BI Dashboard

  • A Streamlit dashboard connects to the curated Athena tables to visualize:
    • Claims summary and customer insights
    • Policy insights by type and premium
  • This provides fast, interactive reporting without provisioning dedicated BI infrastructure.
  • Alternatively, BI tools like Amazon QuickSight or Power BI can be integrated using Athena as a data source.

Cleanup

To delete all resources and avoid ongoing charges:

cdk destroy

Project Structure

├── bin/
│   └── cdk-insurance-app.ts          # CDK entry point
├── lib/
│   └── cdk-insurance-app-stack.ts    # CDK stack definition
├── glue/
│   ├── insurance_etl_incremental.py  # Glue ETL incremental extract script
│   ├── raw_to_processed.py            # Glue ETL transform script
│   └── processed_to_curated.py        # Glue ETL curated transform script
├── lambda/
│   └── notify_failure.py              # Lambda function to handle failure notifications
├── dashboard/
│   └── app.py                   # Streamlit dashboard code
├── package.json
├── cdk.json
└── README.md                         # This file

Notes

  • The RDS instance is publicly accessible for demo convenience. For production, use private subnets and secure connectivity.
  • Secrets like database passwords are stored plainly here; use AWS Secrets Manager or Parameter Store in real deployments.
  • Customize Glue scripts and schema definitions based on your insurance data specifics.
  • This project provides a foundation for modernizing legacy insurance data pipelines with a scalable, governed, and serverless data lake architecture.

License

This project is licensed under the MIT License.

For a detailed walkthrough, read the full project blog here.