This is a data engineering project which implements an end-to-end ETL (extract, transform, load) pipeline.
It extracts data from a database, transforms it to a star schema and finally loads it into an AWS warehouse.
A detailed description of the task can be found in TASK.md
.
- Python version 3.8 or higher
- Terraform
- Relevant credentials for the database and warehouse
- Fork this repo and clone it to your device.
- In the terminal, run
make requirements
. This creates a virtual environment and installs all the requirements inside it. - Run
source venv/bin/activate
to move inside the virtual environment. - Have your credentials in a json format like so:
{"user":USER, "password": PASSWORD, "database": DATABASE, "host": HOST, "port": PORT}
If you want to run this via terraform, store them in a local vars.tf
file. If you want to run this on git push
, store them as a secret on GitHub.
If you are deploying via terraform, in your terminal:
- Run
cd terraform
to move to the terraform directory. - Run
terraform init
to initialise the terraform files. - Run
terraform plan
and, if you are happy with the plan, - Run
terraform apply
. If you are happy to continue, typeyes
.
If you are deploying via git push
, run git push
in your terminal (you will need to make an edit to do this).
In the terminal
- Run
make dev-setup
to install the necessary tools for our tests. - Run
make run-checks
to run the tests.
pytest
boto3
moto
botocore
pg8000
pytest-cov
coverage[toml]==7.6.4
pandas
awswrangler
- Data Extraction: Uses a Python application to automatically ingest data from the totesys operational database into an S3 bucket in AWS.
- Data Transformation: Uses a Python application to process raw data to conform to a star schema for the data warehouse. The transformed data is stored in parquet format in a second S3 bucket.
- Data Loading: Loads transformed data into an AWS-hosted data warehouse, populating dimensions and fact tables.
- Automation: End-to-end pipeline triggered by completion of a data job.
- Monitoring and Alerts: Logs to CloudWatch and sends SNS email alerts in case of failures.