Dimensional Transformers: NC Final Project

Overview

This project is a data engineering solution that extracts, transforms, and loads (ETL) data from a PostgreSQL database hosted on AWS RDS into a data warehouse. This project automates data workflows using AWS services and showcases skills in Python, database modeling, CI/CD automation, good operational practices and Agile methodologies. It adheres to the Minimum Viable Product (MVP) requirements, providing a solid foundation for extensions and enhancements.

Features

Minimum Viable Product (MVP)

ETL Workflow

1. Data Extraction:

Python application (Lambda function) extracts data from a source PostgreSQL database hosted on AWS RDS.
Extracted data is saved in Parquet format to an immutable "ingestion" S3 bucket.
Automation through AWS EventBridge ensures regular scheduling.
Logging and monitoring are handled via CloudWatch, with email alerts for failures.

2. Data Transformation:

Python application (Lambda function) processes data from the "ingestion" bucket, transforming it to align with the target warehouse schema.
Trasformed data is saved in Parquet format to a "processed" S3 bucket.
Includes logging, monitoring, and error alerting.

3. Data Loading:

Python application (Lambda function) loads the transformed data from the "processed" S3 bucket into a PostgreSQL data warehouse using SQLAlchemy
Fact and dimension tables are created and updated as per schema requirements.
Includes logging, monitoring, and error alerting.

Visualization

Tableau dashboard connects directly to the data warehouse to provide actionable insights.

Security & Quality:

Credentials managed using AWS Secrets Manager.
Python applications are PEP8-compliant, tested with Pytest, and scanned for vulnerabilities using Bandit and Safety.
Test coverage exceeds 80%.

Automation:

Infrastructure deployed using Terraform.
CI/CD pipelines configured with GitHub Actions for seamless deployments and testing.

Project Structure

Infrastructure

AWS Services:

S3 for data storage.
CloudWatch for logging and alerting.
Lambda for Python-based ETL processes.
PostgreSQL RDS for the source database and the warehouse.
Secrets Manager for secure credential management.

Terraform:

Automates creation of Lambda functions, S3 buckets, IAM roles, and other resources.

Data Buckets

Ingestion S3 Bucket:
- Stores raw data ingested from the totesys database.
- Organized and immutable to maintain data integrity
Processed S3 Bucket:
- Stores transformed data conforming to the data warehouse schema.

Data Warehouse Schema

Fact Table: fact_sales_order
Dimension Tables: dim_staff, dim_location, dim_design, dim_date, dim_currency, dim_counterparty

CI/CD Pipeline

GitHub Actions Workflow

Run Tests: Installs dependencies, lints, and executes tests.
Deploy Infrastructure: Configures Terraform to deploy necessary AWS resources.
Reset Secrets: Sets up secrets and triggers the first ETL job.
Deploy Lambda: Updates Lambda functions for ETL processes.

Setup Steps

Clone the repository
Configure AWS credentials: Add credentials as GitHib Secrets for CI/CD automation.
Create Virtual Environment make create-environment
Install Python dependencies: make dev-setup
Run tests and checks: make run-checks
Perform the initial data extraction and secret setup: make run-reset-secrets
Deploy infrastructure: terraform init terraform apply -auto-approve
Trigger ETL jobs: Jobs run automatically based on scheduled events or S3 bucket triggers.

Testing

To run the full test suite: make run-checks

Future Enhancements

Expand the warehouse to include all totesys tables.
Ingest from external APIs for supplementary data.

Name		Name	Last commit message	Last commit date
Latest commit History 99 Commits
.github/workflows		.github/workflows
TEST		TEST
terraform		terraform
.coverage		.coverage
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
data_extraction.py		data_extraction.py
data_loading.py		data_loading.py
data_transformation.py		data_transformation.py
process-lambda.zip		process-lambda.zip
requirements.txt		requirements.txt
reset_secrets.py		reset_secrets.py
sandbox-2.py		sandbox-2.py
test_file.txt		test_file.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Dimensional Transformers: NC Final Project

Overview

Features

Minimum Viable Product (MVP)

ETL Workflow

1. Data Extraction:

2. Data Transformation:

3. Data Loading:

Visualization

Security & Quality:

Automation:

Project Structure

Infrastructure

Data Warehouse Schema

CI/CD Pipeline

GitHub Actions Workflow

Setup Steps

Testing

Future Enhancements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 5

Uh oh!

Languages

hamza8599/nc-final-project

Folders and files

Latest commit

History

Repository files navigation

Dimensional Transformers: NC Final Project

Overview

Features

Minimum Viable Product (MVP)

ETL Workflow

1. Data Extraction:

2. Data Transformation:

3. Data Loading:

Visualization

Security & Quality:

Automation:

Project Structure

Infrastructure

Data Warehouse Schema

CI/CD Pipeline

GitHub Actions Workflow

Setup Steps

Testing

Future Enhancements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 5

Uh oh!

Languages

Packages