Skip to content

2024-06-10-15-48-38 #27

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Jun 12, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -28,9 +28,6 @@ __pycache__
# policy
trust-policy.json

# data
data*

# logs
logs/*
*.log
Expand Down Expand Up @@ -80,3 +77,6 @@ override.tf.json
terraform.rc
Footer

dashboard_files/

data/*.csv
1 change: 1 addition & 0 deletions .tool-versions
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
python 3.11.1
22 changes: 6 additions & 16 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -2,15 +2,17 @@
# Setup containers to run Airflow

docker-spin-up:
docker compose --env-file env up airflow-init && docker compose --env-file env up --build -d
docker compose up airflow-init && docker compose up --build -d

perms:
sudo mkdir -p logs plugins temp dags tests migrations && sudo chmod -R u=rwx,g=rwx,o=rwx logs plugins temp dags tests migrations
sudo mkdir -p logs plugins temp dags tests migrations data visualization && sudo chmod -R u=rwx,g=rwx,o=rwx logs plugins temp dags tests migrations data visualization

up: perms docker-spin-up warehouse-migration
up: perms docker-spin-up

down:
docker compose down
docker compose down --volumes --rmi all

restart: down up

sh:
docker exec -ti webserver bash
Expand Down Expand Up @@ -50,18 +52,6 @@ infra-down:
infra-config:
terraform -chdir=./terraform output

####################################################################################################################
# Create tables in Warehouse

db-migration:
@read -p "Enter migration name:" migration_name; docker exec webserver yoyo new ./migrations -m "$$migration_name"

warehouse-migration:
docker exec webserver yoyo develop --no-config-file --database postgres://sdeuser:sdepassword1234@warehouse:5432/finance ./migrations

warehouse-rollback:
docker exec webserver yoyo rollback --no-config-file --database postgres://sdeuser:sdepassword1234@warehouse:5432/finance ./migrations

####################################################################################################################
# Port forwarding to local machine

Expand Down
109 changes: 80 additions & 29 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,21 +1,90 @@


* [Data engineering project template](#data-engineering-project-template)
* [Prerequisites](#prerequisites)
* [Run code](#run-code)
* [Codespaces](#codespaces)
* [Your machine](#your-machine)
* [Infrastructure](#infrastructure)
* [Using template](#using-template)
* [Writing pipelines](#writing-pipelines)
* [(Optional) Advanced cloud setup](#optional-advanced-cloud-setup)
* [Prerequisites:](#prerequisites-1)
* [Tear down infra](#tear-down-infra)

# Data engineering project template

Detailed explanation can be found **[`in this post`](https://www.startdataengineering.com/post/data-engineering-projects-with-free-template/)**

## Prerequisites

To use the template, please install the following.
## Prerequisites

1. [git](https://git-scm.com/book/en/v2/Getting-Started-Installing-Git)
2. [Github account](https://github.com/)
3. [Terraform](https://learn.hashicorp.com/tutorials/terraform/install-cli)
4. [AWS account](https://aws.amazon.com/)
5. [AWS CLI installed](https://docs.aws.amazon.com/cli/latest/userguide/install-cliv2.html) and [configured](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html)
6. [Docker](https://docs.docker.com/engine/install/) with at least 4GB of RAM and [Docker Compose](https://docs.docker.com/compose/install/) v1.27.0 or later
3. [Docker](https://docs.docker.com/engine/install/) with at least 4GB of RAM and [Docker Compose](https://docs.docker.com/compose/install/) v1.27.0 or later

## Run code

### Codespaces

Start a code spaces, run `make up`, wait until its ready and click on the link in the Port tab to see the AirflowUI.

![CodeSpace start](./assets/images/cs1.png)
![Codespace make up](./assets/images/cs2.png)
![Codespace open Airflow UI](./assets/images/cs3.png)

**Note**: Make sure to turn off your codespaces when you are done, you only have a limited amount of free codespace use.

### Your machine

Clone the repo and run the `make up` command as shown here:

```bash
git clone https://github.com/josephmachado/data_engineering_project_template.git
cd data_engineering_project_template
make up
make ci # run checks and tests
sleep 30 # wait for Airflow to start
```
**Windows users**: please setup WSL and a local Ubuntu Virtual machine following **[the instructions here](https://ubuntu.com/tutorials/install-ubuntu-on-wsl2-on-windows-10#1-overview)**. Install the above prerequisites on your ubuntu terminal; if you have trouble installing docker, follow **[the steps here](https://www.digitalocean.com/community/tutorials/how-to-install-and-use-docker-on-ubuntu-22-04#step-1-installing-docker)** (only Step 1 is necessary). Please install the **make** command with `sudo apt install make -y` (if its not already present).

Go to [http:localhost:8080](http:localhost:8080) to see the Airflow UI. Username and password are both `airflow`.

## Infrastructure

This data engineering project template, includes the following:

1. **`Airflow`**: To schedule and orchestrate DAGs.
2. **`Postgres`**: To store Airflow's details (which you can see via Airflow UI) and also has a schema to represent upstream databases.
3. **`DuckDB`**: To act as our warehouse
4. **`Quarto with Plotly`**: To convert code in `markdown` format to html files that can be embedded in your app or servered as is.
5. **`minio`**: To provide an S3 compatible open source storage system.

For simplicity services 1-4 of the above are installed and run in one container defined [here](./containers/airflow/Dockerfile).

If you are using windows please setup WSL and a local Ubuntu Virtual machine following **[the instructions here](https://ubuntu.com/tutorials/install-ubuntu-on-wsl2-on-windows-10#1-overview)**. Install the above prerequisites on your ubuntu terminal, if you have trouble installing docker follow **[the steps here](https://www.digitalocean.com/community/tutorials/how-to-install-and-use-docker-on-ubuntu-22-04#step-1-installing-docker)**.
![File strucutre](./assets/images/fs.png)
![DET](./assets/images/det.png)

### Setup infra
## Using template

You can use this repo as a template and create your own, click on the `Use this template` button.

![Template](./assets/images/template.png)

## Writing pipelines

We have a sample pipeline at [coincap_elt.py](./dags/coincap_elt.py) that you can use as a starter to create your own DAGs. The tests are available at [./tests](./tests) folder.

Once the `coincap_elt` DAG runs, we can see the dashboard html at [./visualization/dashboard.html](./visualization/dashboard.html) and will look like ![Dashboard](./assets/images/dash.png).

## (Optional) Advanced cloud setup

If you want to run your code on an EC2 instance, with terraform, follow the steps below.

### Prerequisites:

1. [Terraform](https://learn.hashicorp.com/tutorials/terraform/install-cli)
2. [AWS account](https://aws.amazon.com/)
3. [AWS CLI installed](https://docs.aws.amazon.com/cli/latest/userguide/install-cliv2.html) and [configured](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html)

You can create your GitHub repository based on this template by clicking on the `Use this template button in the **[data_engineering_project_template](https://github.com/josephmachado/data_engineering_project_template)** repository. Clone your repository and replace content in the following files

Expand All @@ -26,10 +95,6 @@ You can create your GitHub repository based on this template by clicking on the
Run the following commands in your project directory.

```shell
# Local run & test
make up # start the docker containers on your computer & runs migrations under ./migrations
make ci # Runs auto formatting, lint checks, & all the test files under ./tests

# Create AWS services with Terraform
make tf-init # Only needed on your first terraform run (or if you add new providers)
make infra-up # type in yes after verifying the changes TF will make
Expand All @@ -45,21 +110,6 @@ make cloud-metabase # this command will forward Metabase port from EC2 to your m
# use https://github.com/josephmachado/data_engineering_project_template/blob/main/env file to connect to the warehouse from metabase
```

**Data infrastructure**
![DE Infra](/assets/images/infra.png)

**Project structure**
![Project structure](/assets/images/proj_1.png)
![Project structure - GH actions](/assets/images/proj_2.png)

Database migrations can be created as shown below.

```shell
make db-migration # enter a description, e.g. create some schema
# make your changes to the newly created file under ./migrations
make warehouse-migration # to run the new migration on your warehouse
```

For the [continuous delivery](https://github.com/josephmachado/data_engineering_project_template/blob/main/.github/workflows/cd.yml) to work, set up the infrastructure with terraform, & defined the following repository secrets. You can set up the repository secrets by going to `Settings > Secrets > Actions > New repository secret`.

1. **`SERVER_SSH_KEY`**: We can get this by running `terraform -chdir=./terraform output -raw private_key` in the project directory and paste the entire content in a new Action secret called SERVER_SSH_KEY.
Expand All @@ -73,4 +123,5 @@ After you are done, make sure to destroy your cloud infrastructure.
```shell
make down # Stop docker containers on your computer
make infra-down # type in yes after verifying the changes TF will make
```
```

Binary file added assets/images/cs1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/images/cs2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/images/cs3.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/images/dash.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/images/det.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/images/fs.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/images/template.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
5 changes: 4 additions & 1 deletion containers/airflow/Dockerfile
Original file line number Diff line number Diff line change
@@ -1,3 +1,6 @@
FROM apache/airflow:2.2.0
FROM apache/airflow:2.9.2
COPY requirements.txt /
RUN pip install --no-cache-dir -r /requirements.txt

COPY quarto.sh /
RUN cd / && bash /quarto.sh
12 changes: 12 additions & 0 deletions containers/airflow/quarto.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
#!/bin/bash

curl -L -o ~/quarto-1.5.43-linux-amd64.tar.gz https://github.com/quarto-dev/quarto-cli/releases/download/v1.5.43/quarto-1.5.43-linux-amd64.tar.gz
mkdir ~/opt
tar -C ~/opt -xvzf ~/quarto-1.5.43-linux-amd64.tar.gz

mkdir ~/.local/bin
ln -s ~/opt/quarto-1.5.43/bin/quarto ~/.local/bin/quarto

( echo ""; echo 'export PATH=$PATH:~/.local/bin\n' ; echo "" ) >> ~/.profile
source ~/.profile

22 changes: 13 additions & 9 deletions containers/airflow/requirements.txt
Original file line number Diff line number Diff line change
@@ -1,9 +1,13 @@
black==22.8.0
flake8==5.0.4
mypy==0.971
isort==5.10.1
moto[all]==4.0.6
pytest==7.0.1
pytest-mock==3.6.1
apache-airflow-client==2.3.0
yoyo-migrations==8.0.0
black==24.4.2
flake8==7.0.0
mypy==1.10.0
isort==5.13.2
moto[all]==5.0.9
pytest==8.2.2
pytest-mock==3.14.0
apache-airflow-client==2.9.0
yoyo-migrations==8.2.0
duckdb==1.0.0
plotly==5.22.0
jupyter==1.0.0
types-requests==2.32.0.20240602
4 changes: 0 additions & 4 deletions dags/.gitignore

This file was deleted.

43 changes: 43 additions & 0 deletions dags/coincap_elt.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
import csv
import os
from datetime import datetime, timedelta

import requests

from airflow import DAG
from airflow.decorators import task
from airflow.operators.bash import BashOperator

with DAG(
'coincap_elt',
description='A simple DAG to fetch data \
from CoinCap Exchanges API and write to a file',
schedule_interval=timedelta(days=1),
start_date=datetime(2023, 1, 1),
catchup=False,
) as dag:

url = "https://api.coincap.io/v2/exchanges"
file_path = f'{os.getenv("AIRFLOW_HOME")}/data/coincap_exchanges.csv'

@task
def fetch_coincap_exchanges(url, file_path):
response = requests.get(url)
data = response.json()
exchanges = data['data']
if exchanges:
keys = exchanges[0].keys()
with open(file_path, 'w') as f:
dict_writer = csv.DictWriter(f, fieldnames=keys)
dict_writer.writeheader()
dict_writer.writerows(exchanges)

markdown_path = f'{os.getenv("AIRFLOW_HOME")}/visualization/'
q_cmd = (
f'cd {markdown_path} && quarto render {markdown_path}/dashboard.qmd'
)
gen_dashboard = BashOperator(
task_id="generate_dashboard", bash_command=q_cmd
)

fetch_coincap_exchanges(url, file_path) >> gen_dashboard
Empty file added data/.gitkeep
Empty file.
68 changes: 3 additions & 65 deletions docker-compose.yml
Original file line number Diff line number Diff line change
@@ -1,41 +1,3 @@
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.
#

# Basic Airflow cluster configuration for CeleryExecutor with Redis and PostgreSQL.
#
# WARNING: This configuration is for local development. Do not use it in a production deployment.
#
# This configuration supports basic configuration using environment variables or an .env file
# The following variables are supported:
#
# AIRFLOW_IMAGE_NAME - Docker image name used to run Airflow.
# Default: apache/airflow:master-python3.8
# AIRFLOW_UID - User ID in Airflow containers
# Default: 50000
# AIRFLOW_GID - Group ID in Airflow containers
# Default: 50000
# _AIRFLOW_WWW_USER_USERNAME - Username for the administrator account.
# Default: airflow
# _AIRFLOW_WWW_USER_PASSWORD - Password for the administrator account.
# Default: airflow
#
# Feel free to modify this file to suit your needs.
---
version: '3'
x-airflow-common:
&airflow-common
Expand All @@ -50,14 +12,11 @@ x-airflow-common:
AIRFLOW__CORE__LOAD_EXAMPLES: 'false'
AIRFLOW__API__AUTH_BACKEND: 'airflow.api.auth.backend.basic_auth'
AIRFLOW_CONN_POSTGRES_DEFAULT: postgres://airflow:airflow@postgres:5432/airflow
WAREHOUSE_USER: ${POSTGRES_USER}
WAREHOUSE_PASSWORD: ${POSTGRES_PASSWORD}
WAREHOUSE_DB: ${POSTGRES_DB}
WAREHOUSE_HOST: ${POSTGRES_HOST}
WARREHOUSE_PORT: ${POSTGRES_PORT}

volumes:
- ./dags:/opt/airflow/dags
- ./data:/opt/airflow/data
- ./visualization:/opt/airflow/visualization
- ./logs:/opt/airflow/logs
- ./plugins:/opt/airflow/plugins
- ./tests:/opt/airflow/tests
Expand All @@ -71,7 +30,7 @@ x-airflow-common:
services:
postgres:
container_name: postgres
image: postgres:13
image: postgres:16
environment:
POSTGRES_USER: airflow
POSTGRES_PASSWORD: airflow
Expand Down Expand Up @@ -127,24 +86,3 @@ services:
_AIRFLOW_WWW_USER_CREATE: 'true'
_AIRFLOW_WWW_USER_USERNAME: ${_AIRFLOW_WWW_USER_USERNAME:-airflow}
_AIRFLOW_WWW_USER_PASSWORD: ${_AIRFLOW_WWW_USER_PASSWORD:-airflow}

dashboard:
image: metabase/metabase
container_name: dashboard
ports:
- "3000:3000"

warehouse:
image: postgres:13
container_name: warehouse
environment:
POSTGRES_USER: ${POSTGRES_USER}
POSTGRES_PASSWORD: ${POSTGRES_PASSWORD}
POSTGRES_DB: ${POSTGRES_DB}
healthcheck:
test: [ "CMD", "pg_isready", "-U", "${POSTGRES_USER}" ]
interval: 5s
retries: 5
restart: always
ports:
- "5439:5432"
5 changes: 0 additions & 5 deletions env

This file was deleted.

Loading
Loading