Insurance Data Lake with AWS CDK and Glue

Project Overview

This project builds a data lake architecture for insurance data using AWS CDK for infrastructure as code and AWS Glue for ETL pipelines. It extracts incremental data from an RDS MySQL database, stages it in an S3 raw zone, transforms it into a processed zone, and prepares curated data for analytics.

Components

VPC: A public subnet-only VPC (no NAT gateways) to host resources.
Amazon RDS (MySQL): Source database with insurance data.
S3 Buckets:
- RawZoneBucket: Stores raw data extracts as Parquet files partitioned by date.
- ProcessedZoneBucket: Stores cleaned and processed data.
- CuratedZoneBucket: Stores analytics-ready curated datasets.
AWS Glue:
- Glue Jobs:
  - Incremental extract from RDS → Raw S3 bucket.
  - Transform Raw → Processed S3 bucket.
- Glue Job Role with permissions on S3, DynamoDB, and Glue.
DynamoDB Table: Tracks last processed timestamp per table for incremental ETL.
IAM Roles: For secure Glue job access.
Glue Script Assets: Python ETL scripts for incremental extraction and transformation.

Prerequisites

AWS CLI configured with appropriate IAM permissions.
Node.js and AWS CDK installed.
AWS credentials set up.
Java installed (required by CDK).

Deployment Steps

Clone the repo:

git clone <your-repo-url>
cd <repo-directory>

Install dependencies:
```
npm install
```
Bootstrap the CDK environment (if needed):
```
cdk bootstrap
```
Deploy the stack:
```
cdk deploy
```
This deploys:
- VPC
- S3 buckets (raw, processed, curated)
- RDS MySQL instance
- Glue Job roles and Glue Jobs
- DynamoDB state tracking table

Glue ETL Scripts

insurance_etl_incremental.py: Extracts new/updated records from RDS incrementally based on a timestamp column, writes to raw S3 bucket partitioned by date.
raw_to_processed.py: Reads raw data, performs cleaning/transformation, writes to processed S3 bucket.
processed_to_curated.py: Transforms processed data into curated datasets optimized for analytics.

Scripts are uploaded automatically as CDK assets.

Running the ETL

Trigger Glue jobs manually or automate with EventBridge or Step Functions.
Glue jobs read the DynamoDB state table to track incremental processing.
Data lands in S3 as partitioned Parquet files for efficient querying.

Glue Crawlers and Data Catalog

Glue Crawlers scan the raw, processed, and curated S3 buckets to update the Glue Data Catalog with current table schemas.
This enables seamless querying through Amazon Athena or Redshift Spectrum.
Schema changes are handled with Glue's schema update policies (updateBehavior: LOG, deleteBehavior: LOG), allowing the catalog to evolve safely with incoming data changes.

Orchestration with AWS Step Functions

All Glue jobs and crawlers are orchestrated with Step Functions to ensure workflow sequencing.
Implemented retry logic with exponential backoff to handle transient failures (up to 2 retries with 10-second intervals).
Errors caught by a Lambda function that publishes notifications to an SNS topic for real-time alerting.

Data Governance with Lake Formation (Optional)

Lake Formation can be integrated for fine-grained access control and centralized data governance across the data lake.
This project structure supports adding Lake Formation permissions on S3 buckets and Glue Data Catalog tables.

BI Dashboard

A Streamlit dashboard connects to the curated Athena tables to visualize:
- Claims summary and customer insights
- Policy insights by type and premium
This provides fast, interactive reporting without provisioning dedicated BI infrastructure.
Alternatively, BI tools like Amazon QuickSight or Power BI can be integrated using Athena as a data source.

Cleanup

To delete all resources and avoid ongoing charges:

cdk destroy

Project Structure

├── bin/
│   └── cdk-insurance-app.ts          # CDK entry point
├── lib/
│   └── cdk-insurance-app-stack.ts    # CDK stack definition
├── glue/
│   ├── insurance_etl_incremental.py  # Glue ETL incremental extract script
│   ├── raw_to_processed.py            # Glue ETL transform script
│   └── processed_to_curated.py        # Glue ETL curated transform script
├── lambda/
│   └── notify_failure.py              # Lambda function to handle failure notifications
├── dashboard/
│   └── app.py                   # Streamlit dashboard code
├── package.json
├── cdk.json
└── README.md                         # This file

Notes

The RDS instance is publicly accessible for demo convenience. For production, use private subnets and secure connectivity.
Secrets like database passwords are stored plainly here; use AWS Secrets Manager or Parameter Store in real deployments.
Customize Glue scripts and schema definitions based on your insurance data specifics.
This project provides a foundation for modernizing legacy insurance data pipelines with a scalable, governed, and serverless data lake architecture.

License

This project is licensed under the MIT License.

For a detailed walkthrough, read the full project blog here.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
Sample		Sample
Scripts		Scripts
bin		bin
dashboard		dashboard
doc		doc
glue		glue
lambda		lambda
lib		lib
test		test
.DS_Store		.DS_Store
.gitignore		.gitignore
.npmignore		.npmignore
README.md		README.md
cdk.json		cdk.json
jest.config.js		jest.config.js
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Insurance Data Lake with AWS CDK and Glue

Project Overview

Components

Prerequisites

Deployment Steps

Glue ETL Scripts

Running the ETL

Glue Crawlers and Data Catalog

Orchestration with AWS Step Functions

Data Governance with Lake Formation (Optional)

BI Dashboard

Cleanup

Project Structure

Notes

License

About

Uh oh!

Releases

Packages

Languages

arpan65/aws-insurance-data-lake-cdk-glue

Folders and files

Latest commit

History

Repository files navigation

Insurance Data Lake with AWS CDK and Glue

Project Overview

Components

Prerequisites

Deployment Steps

Glue ETL Scripts

Running the ETL

Glue Crawlers and Data Catalog

Orchestration with AWS Step Functions

Data Governance with Lake Formation (Optional)

BI Dashboard

Cleanup

Project Structure

Notes

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages