This project is a data engineering solution that extracts, transforms, and loads (ETL) data from a PostgreSQL database hosted on AWS RDS into a data warehouse. This project automates data workflows using AWS services and showcases skills in Python, database modeling, CI/CD automation, good operational practices and Agile methodologies. It adheres to the Minimum Viable Product (MVP) requirements, providing a solid foundation for extensions and enhancements.
- Python application (Lambda function) extracts data from a source PostgreSQL database hosted on AWS RDS.
- Extracted data is saved in Parquet format to an immutable "ingestion" S3 bucket.
- Automation through AWS EventBridge ensures regular scheduling.
- Logging and monitoring are handled via CloudWatch, with email alerts for failures.
- Python application (Lambda function) processes data from the "ingestion" bucket, transforming it to align with the target warehouse schema.
- Trasformed data is saved in Parquet format to a "processed" S3 bucket.
- Includes logging, monitoring, and error alerting.
- Python application (Lambda function) loads the transformed data from the "processed" S3 bucket into a PostgreSQL data warehouse using SQLAlchemy
- Fact and dimension tables are created and updated as per schema requirements.
- Includes logging, monitoring, and error alerting.
- Tableau dashboard connects directly to the data warehouse to provide actionable insights.
- Credentials managed using AWS Secrets Manager.
- Python applications are PEP8-compliant, tested with Pytest, and scanned for vulnerabilities using Bandit and Safety.
- Test coverage exceeds 80%.
- Infrastructure deployed using Terraform.
- CI/CD pipelines configured with GitHub Actions for seamless deployments and testing.
AWS Services:
- S3 for data storage.
- CloudWatch for logging and alerting.
- Lambda for Python-based ETL processes.
- PostgreSQL RDS for the source database and the warehouse.
- Secrets Manager for secure credential management.
Terraform:
- Automates creation of Lambda functions, S3 buckets, IAM roles, and other resources.
Data Buckets
- Ingestion S3 Bucket:
- Stores raw data ingested from the
totesys
database. - Organized and immutable to maintain data integrity
- Stores raw data ingested from the
- Processed S3 Bucket:
- Stores transformed data conforming to the data warehouse schema.
- Fact Table:
fact_sales_order
- Dimension Tables:
dim_staff
,dim_location
,dim_design
,dim_date
,dim_currency
,dim_counterparty
- Run Tests: Installs dependencies, lints, and executes tests.
- Deploy Infrastructure: Configures Terraform to deploy necessary AWS resources.
- Reset Secrets: Sets up secrets and triggers the first ETL job.
- Deploy Lambda: Updates Lambda functions for ETL processes.
- Clone the repository
- Configure AWS credentials: Add credentials as GitHib Secrets for CI/CD automation.
- Create Virtual Environment
make create-environment
- Install Python dependencies:
make dev-setup
- Run tests and checks:
make run-checks
- Perform the initial data extraction and secret setup:
make run-reset-secrets
- Deploy infrastructure:
terraform init
terraform apply -auto-approve
- Trigger ETL jobs: Jobs run automatically based on scheduled events or S3 bucket triggers.
To run the full test suite:
make run-checks
- Expand the warehouse to include all
totesys
tables. - Ingest from external APIs for supplementary data.