Skip to content

MirriamKM/Automated-ETL-Pipeline-for-AWS-Data-Warehousing

Repository files navigation

Lambda Legends Data Processor

image
This is a data engineering project which implements an end-to-end ETL (extract, transform, load) pipeline. It extracts data from a database, transforms it to a star schema and finally loads it into an AWS warehouse. A detailed description of the task can be found in TASK.md.

Prerequisites:

  • Python version 3.8 or higher
  • Terraform
  • Relevant credentials for the database and warehouse

Setup:

  1. Fork this repo and clone it to your device.
  2. In the terminal, run make requirements. This creates a virtual environment and installs all the requirements inside it.
  3. Run source venv/bin/activate to move inside the virtual environment.
  4. Have your credentials in a json format like so:
   {"user":USER,
    "password": PASSWORD,
    "database": DATABASE,
    "host": HOST,
    "port": PORT}
 

If you want to run this via terraform, store them in a local vars.tf file. If you want to run this on git push, store them as a secret on GitHub.

Deployment:

If you are deploying via terraform, in your terminal:

  1. Run cd terraform to move to the terraform directory.
  2. Run terraform init to initialise the terraform files.
  3. Run terraform plan and, if you are happy with the plan,
  4. Run terraform apply. If you are happy to continue, type yes.

If you are deploying via git push, run git push in your terminal (you will need to make an edit to do this).

Testing:

In the terminal

  1. Run make dev-setup to install the necessary tools for our tests.
  2. Run make run-checks to run the tests.

Used technologies:

  • pytest
  • boto3
  • moto
  • botocore
  • pg8000
  • pytest-cov
  • coverage[toml]==7.6.4
  • pandas
  • awswrangler

Current features:

  • Data Extraction: Uses a Python application to automatically ingest data from the totesys operational database into an S3 bucket in AWS.
  • Data Transformation: Uses a Python application to process raw data to conform to a star schema for the data warehouse. The transformed data is stored in parquet format in a second S3 bucket.
  • Data Loading: Loads transformed data into an AWS-hosted data warehouse, populating dimensions and fact tables.
  • Automation: End-to-end pipeline triggered by completion of a data job.
  • Monitoring and Alerts: Logs to CloudWatch and sends SNS email alerts in case of failures.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •