Data Pipeline Project - Terrific Totes

Overview

Terrific Totes is a fictional company that operates an OLTP database and a data warehouse used for reporting and visualizations. The goal of this project is to develop applications that Extract, Transform, and Load (ETL) data from the OLTP database into a data lake and warehouse hosted in AWS. This solution is reliable, scalable, and fully managed using Infrastructure-as-Code.

Features

Automated Data Processing
- EventBridge Scheduler: Triggers data ingestion every 5 minutes.
- Step Machine with JSON Payloads: Orchestrates the workflow.
- Lambda Functions & Layers:
  - One Python application ingests all tables from the totesys database. Implements incremental refresh so that only new and/or updated data is processed for optimisation of resources.
  - Another Python application remodels the data into a predefined schema and stores it in the "processed" S3 bucket in Parquet format.
  - A third Python application loads the transformed data into a data warehouse, with new and updated data reflected in the target database within 30 minutes. Fact tables retain a full history of how the data has evolved.
CloudWatch Monitoring & SNS Alerts
- Logs errors, tracks performance, and sends critical failure notifications via SNS.
Secure Data Management
- IAM Policies: Implements the principle of least privilege.
- Secrets Manager: Manages database credentials securely.
Data Storage in S3
- Raw Data Bucket: Stores ingested data in its original form, acts as a data lake.
- Processed Data Bucket: Holds transformed data in an immutable, well-structured format.
Code Quality & Security
- Python code is PEP8 compliant, thoroughly tested, and checked for security vulnerabilities using pip-audit and bandit.

Architecture

Tech Stack

Version Control: GitHub
Infrastructure & CI/CD: Terraform, GitHub Actions, Makefile
Programming & Libraries: Python, boto3, pandas, numpy, pg8000, freezegun, coverage, bandit, black
AWS Services: CloudWatch, Lambda, EventBridge, Step Functions, SNS, S3, Secrets Manager
Development Tools: Visual Studio Code, Tableau

Installation & Setup

Create an AWS Account and configure AWS credentials. The AWS user will need administrative privileges to set up the following:
- S3 Buckets
- Step Machine
- Lambda Functions
- Cloudwatch
- Secrets Manager
- SNS
Ensure you have Python 3.12 installed
- Check your version:
```
python --version
```
- If needed, install it from python.org.
Fork and Clone the Repository
(Approx. 3.8 MiB memory needed to clone and set up the repository)
```
git clone https://github.com/ys218279/team-09-sperrins.git
cd team-09-sperrins
```
Set Up Databases
- Create test source and target databases from the provided ERDs.
- Source DB: https://dbdiagram.io/d/SampleDB-6332fecf7b3d2034ffcaaa92
- Target DB: https://dbdiagram.io/d/RevisedDW-63a19c5399cb1f3b55a27eca
Store Credentials in GitHub secrets
- Store AWS credentials, database credentials, and SNS email in GitHub secrets.

Install Terraform (if not already installed)

brew tap hashicorp/tap
brew install hashicorp/tap/terraform

Set Up Terraform Backend
- Create an S3 bucket to store Terraform state files.

Run Makefile Commands

make all  # Creates virtual environment, installs dependencies, formats code, runs security and test coverage checks

Deploy Infrastructure
```
cd terraform
terraform init
terraform plan
terraform apply
```
- During terraform apply, will be prompted to enter credentials in the terminal.
(Alternative Method) Configure Sensitive Variables

Create a sensitive.tfvars file with database credentials and SNS email.

(Alternative Method) Deploy Infrastructure (Using Variables)

cd terraform
terraform init
terraform plan -var-file=sensitive.tfvars
terraform apply -var-file=sensitive.tfvars

(If deploying via GitHub Actions, this step can be automated.)

Folder Structure

/team-09-sperrins
|-- .github/workflows/      # Automation files
|   |-- deploy.yml          # Script for CI/CD pipeline
|-- .layer/python/          # Lambda layer dependency files
|   |-- requirements.txt    # Lambda layer dependencies
|-- src/                    # Source code for the pipeline
|   |-- src_ingestion/      # Scripts for data extraction
|   |-- src_transform/      # Scripts for data transformation
|   |-- src_load/           # Scripts for data loading
|-- logs/                   # Logging output
|-- terraform/              # Terraform files
|-- test/                   # Unit and integration tests
|-- architecture.png        # Architecture diagram 
|-- ETL_architecture.png    # ETL Architecture diagram 
|-- mvp.png                 # Project brief diagram
|-- Makefile/               # Scripts for the env/installs/tests
|-- requirements.txt        # Project dependencies
|-- README.md               # Documentation for final project
|-- READMEBRIEF.md          # Documentation for project brief

Contact

For any questions or issues, reach out via GitHub Issues.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Data Pipeline Project - Terrific Totes

Overview

Features

Architecture

Tech Stack

Installation & Setup

Folder Structure

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 14

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 296 Commits
.github/workflows		.github/workflows
layer		layer
src		src
terraform		terraform
test		test
.gitignore		.gitignore
ETL_architecture.png		ETL_architecture.png
Makefile		Makefile
README.md		README.md
READMEBRIEF.md		READMEBRIEF.md
architecture.png		architecture.png
mvp.png		mvp.png
requirements.txt		requirements.txt

AngeloRohanathan/data-pipeline-group-project

Folders and files

Latest commit

History

Repository files navigation

Data Pipeline Project - Terrific Totes

Overview

Features

Architecture

Tech Stack

Installation & Setup

Folder Structure

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 14

Uh oh!

Languages

Packages