- Overview
- Data Visualization
- Data Architecture
- Prerequisites
- How to Run This Project
- Lessons Learned
- Contact
This project is an ELT data pipeline using modern data stacks. It retrieves data from the Marvel API
, ingests it into AWS RDS Postgres
using a Python
script, which is automated by Prefect's
workflow orchestration tool, transforms data with dbt
to build a dimensional model, and loads it into a PostgreSQL database for analytics. Additionally, it employs GitHub Action
CI/CD
to create a workflow that builds models each time changes are made and pushed to the main branch of the Git
repository.
I chose the following tools for this project because they are a good combination for learning about
ELT
, workflow orchestration, as well as CI/CD
, and they also collectively address data extraction, automation, transformation, modeling, and deployment needs for a data pipeline project.
- Data Source:
Marvel API
for data source. - Data Ingestion:
Python
script for extracting and loading data intoAWS RDS PostgreSQL
. - Database:
AWS RDS PostgreSQL
for structured data storage. - Workflow Orchestration:
Prefect's
Workflow Orchestration tool for automation, scheduling, and error handling. - Data Transformation:
dbt
(Data Build Tool) for transforming and modeling data. - Data Modeling: Creation of a dimensional model for analytics.
- CI/CD:
GitHub Action
CI/CD
to automate testing and deployment of data models on changes to the main branch. - Data Analytics and Visualization:
Qlik Sense
for creating interactive dashboards, data exploration, and reporting.
- Marvel Studio API Keys
- Python
- Prefect CLI
- AWS RDS Account
- dbt core
- GitHub Action
- Docker
- Qlik Sense Account
In order to run this project step-by-step you need to install the following packages:
- Run ```pip install pipenv`` to create a virtual environment
- Install the dependencies by running
pipenv install -r requirements.txt
- Enter your Marvel Studio API secrets credentials in
.env
file - Run the
extract_load.py
file to start the EL(Extract & Load phase) withPrefect
. Please refer to this tutorial on how to configure Prefect. - Install
dbt core
on your local machine. Please refer to this tutorial install and rundbt core
. - Implement the
GitHub Action
by creating the.github/workflows
dir in the main branch and add theworkflow.yml
file to the.github/workflows
dir - Connect the
Qlik Sense
to theAWS RDS
and create your visualization.
One of the key takeaways from this project is the importance of scalability and flexibility in designing data pipelines. By utilizing cloud-based solutions like AWS RDS PostgreSQL, the project ensures scalability to handle varying data volumes. Additionally, the ability to adapt to changing requirements, data sources, and business needs is crucial. The choice of modular tools and technologies facilitates easy modifications and enhancements to the pipeline, allowing it to evolve alongside the organization's requirements.
You can reach me on LinkedIn to learn more about this project, and I'm open to collaboration.