Lambda Legends Data Processor

This is a data engineering project which implements an end-to-end ETL (extract, transform, load) pipeline. It extracts data from a database, transforms it to a star schema and finally loads it into an AWS warehouse. A detailed description of the task can be found in TASK.md.

Prerequisites:

Python version 3.8 or higher
Terraform
Relevant credentials for the database and warehouse

Setup:

Fork this repo and clone it to your device.
In the terminal, run make requirements. This creates a virtual environment and installs all the requirements inside it.
Run source venv/bin/activate to move inside the virtual environment.
Have your credentials in a json format like so:

   {"user":USER,
    "password": PASSWORD,
    "database": DATABASE,
    "host": HOST,
    "port": PORT}

If you want to run this via terraform, store them in a local vars.tf file. If you want to run this on git push, store them as a secret on GitHub.

Deployment:

If you are deploying via terraform, in your terminal:

Run cd terraform to move to the terraform directory.
Run terraform init to initialise the terraform files.
Run terraform plan and, if you are happy with the plan,
Run terraform apply. If you are happy to continue, type yes.

If you are deploying via git push, run git push in your terminal (you will need to make an edit to do this).

Testing:

In the terminal

Run make dev-setup to install the necessary tools for our tests.
Run make run-checks to run the tests.

Used technologies:

pytest
boto3
moto
botocore
pg8000
pytest-cov
coverage[toml]==7.6.4
pandas
awswrangler

Current features:

Data Extraction: Uses a Python application to automatically ingest data from the totesys operational database into an S3 bucket in AWS.
Data Transformation: Uses a Python application to process raw data to conform to a star schema for the data warehouse. The transformed data is stored in parquet format in a second S3 bucket.
Data Loading: Loads transformed data into an AWS-hosted data warehouse, populating dimensions and fact tables.
Automation: End-to-end pipeline triggered by completion of a data job.
Monitoring and Alerts: Logs to CloudWatch and sends SNS email alerts in case of failures.

Name		Name	Last commit message	Last commit date
Latest commit History 218 Commits
.github/workflows		.github/workflows
src		src
terraform		terraform
test		test
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
TASK.md		TASK.md
requirements-ci.txt		requirements-ci.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Lambda Legends Data Processor

Prerequisites:

Setup:

Deployment:

Testing:

Used technologies:

Current features:

About

Uh oh!

Releases

Packages

Contributors 4

Uh oh!

Languages

MirriamKM/Automated-ETL-Pipeline-for-AWS-Data-Warehousing

Folders and files

Latest commit

History

Repository files navigation

Lambda Legends Data Processor

Prerequisites:

Setup:

Deployment:

Testing:

Used technologies:

Current features:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Uh oh!

Languages

Packages