-
Notifications
You must be signed in to change notification settings - Fork 64
MVP ETL Module #90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
dehume
wants to merge
15
commits into
main
Choose a base branch
from
dennis/ce-814-etl-code-example
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
MVP ETL Module #90
Changes from 8 commits
Commits
Show all changes
15 commits
Select commit
Hold shift + click to select a range
20fe2c4
Check
dehume d8efbd6
Merge branch 'main' into dennis/ce-814-etl-code-example
dehume 0f7f135
gitkeep staging dir
dehume 83a81ac
dg
dehume bb0e5cb
Check
dehume bb1091a
Check
dehume 94ea2e2
Check
dehume 6d1b59d
Check
dehume ecd21d0
Check
dehume 9b80262
Check
dehume 37262ef
Check
dehume 73afebc
Christian Suggestions
dehume 105212e
Colton review part 1
dehume 218169b
Run through
dehume c8b53b6
Merge branch 'main' into dennis/ce-814-etl-code-example
dehume File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
FROM mcr.microsoft.com/devcontainers/python:0-3.11-bullseye | ||
ENV PYTHONUNBUFFERED 1 | ||
|
||
COPY --from=ghcr.io/astral-sh/uv:0.6.10 /uv /bin/uv | ||
|
||
COPY dagster_university/dagster_and_etl/pyproject.toml . | ||
RUN uv pip install -r pyproject.toml --system |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,15 @@ | ||
{ | ||
"name": "Dagster & ETL", | ||
"build": { | ||
"dockerfile": "Dockerfile", | ||
"context": "../.." | ||
}, | ||
"forwardPorts": [ | ||
3000 | ||
], | ||
"portsAttributes": { | ||
"3000": { | ||
"label": "Dagster" | ||
} | ||
} | ||
} |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,20 @@ | ||
name: quality-check-dagster-and-etl | ||
|
||
on: | ||
schedule: | ||
- cron: "0 0 * * 0" | ||
|
||
pull_request: | ||
types: [opened, synchronize, reopened] | ||
paths: | ||
- dagster_university/dagster_and_etl/** | ||
|
||
jobs: | ||
check: | ||
if: github.event.pull_request.draft == false | ||
uses: ./.github/workflows/template-quality-check.yml | ||
with: | ||
working_directory: ./dagster_university/dagster_and_etl | ||
# TODO: Disable integration tests from GHA | ||
# postgres image has no windows/amd64 | ||
windows_pytest_cmd: uv run pytest dagster_and_etl/completed -v -m "not integration" | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -186,4 +186,5 @@ tmp*/ | |
# dbt | ||
.user.yml | ||
|
||
postgres_data/ | ||
postgres_data/ | ||
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,52 @@ | ||
--- | ||
title: Dagster ETL | ||
--- | ||
|
||
- Lesson 1: Introduction to ETL with Dagster | ||
- [About this course](/dagster-etl/lesson-1/0-about-this-course) | ||
- [What is ETL?](/dagster-etl/lesson-1/1-what-is-etl) | ||
- [ETL and Dagster](/dagster-etl/lesson-1/2-etl-and-dagster) | ||
- [Project preview](/dagster-etl/lesson-1/3-project-preview) | ||
|
||
- Lesson 2: Installation & Setup | ||
- [Requirements](/dagster-etl/lesson-2/0-requirements) | ||
- [Set up local](/dagster-etl/lesson-2/1-set-up-local) | ||
- [Set up Codespace](/dagster-etl/lesson-2/2-set-up-codespace) | ||
|
||
- Lesson 3: Loading Static Data into DuckDB | ||
- [Overview](/dagster-etl/lesson-3/0-overview) | ||
- [File import](/dagster-etl/lesson-3/1-file-import) | ||
- [Data integrity](/dagster-etl/lesson-3/2-data-integrity) | ||
- [Partitions](/dagster-etl/lesson-3/3-partitions) | ||
- [Complex partitions](/dagster-etl/lesson-3/4-complex-partitions) | ||
- [Triggering partitions](/dagster-etl/lesson-3/5-triggering-partitions) | ||
|
||
- Lesson 4: ETL with APIs | ||
- [Overview](/dagster-etl/lesson-4/0-overview) | ||
- [APIs](/dagster-etl/lesson-4/1-apis) | ||
- [API resource](/dagster-etl/lesson-4/2-api-resource) | ||
- [ETL with API](/dagster-etl/lesson-4/3-etl-with-api) | ||
- [API Dagster](/dagster-etl/lesson-4/4-api-dagster-assets) | ||
- [Triggering API jobs](/dagster-etl/lesson-4/5-triggering-api-jobs) | ||
- [Backfilling from APIs](/dagster-etl/lesson-4/6-backfilling-from-apis) | ||
|
||
- Lesson 5: Embedded ETL | ||
- [Overview](/dagster-etl/lesson-5/0-overview) | ||
- [dlt](/dagster-etl/lesson-5/1-dlt) | ||
- [Basic dlt](/dagster-etl/lesson-5/2-basic-dlt) | ||
- [Dagster and dlt](/dagster-etl/lesson-5/3-dagster-and-dlt) | ||
dehume marked this conversation as resolved.
Show resolved
Hide resolved
|
||
- [Refactoring static data with dlt](/dagster-etl/lesson-5/4-refactoring-static-data-with-dlt) | ||
- [Refactoring APIs with dlt](/dagster-etl/lesson-5/5-refactoring-apis-with-dlt) | ||
|
||
- Lesson 6: Database replication | ||
- [Overview](/dagster-etl/lesson-6/0-overview) | ||
- [Database replication](/dagster-etl/lesson-6/1-database-replication) | ||
- [dlt database replication set up](/dagster-etl/lesson-6/2-dlt-database-replication-set-up) | ||
- [dlt database assets](/dagster-etl/lesson-6/3-dlt-database-assets) | ||
- [Executing pipeline](/dagster-etl/lesson-6/4-executing-pipeline) | ||
|
||
- Lesson 7: ETL with Components | ||
- [Overview](/dagster-etl/lesson-7/0-overview) | ||
- [Dagster Components](/dagster-etl/lesson-7/1-dagster-components) | ||
- [dlt with components](/dagster-etl/lesson-7/2-dlt-with-components) | ||
- [Using components](/dagster-etl/lesson-7/3-using-components) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,27 @@ | ||
--- | ||
title: "Lesson 1: About this course & getting help" | ||
module: 'dagster_etl' | ||
lesson: '1' | ||
--- | ||
|
||
# About this course | ||
|
||
This course is geared towards those who have some familiarity Dagster and want to learn move about ETL. You don't need to be an expert, but you should know your way around a Dagster project. | ||
|
||
In this course, you’ll learn how to orchestrate your ETL layer using Dagster. You’ll load static files into a data warehouse using schedules, sensors, and partitions. You'll explore the nuances of extracting data from APIs, streamline your workflows with ETL frameworks, and replicate data across databases. Finally, you'll see how Dagster Components can help you build production-quality ETL solutions with just a few lines of code. | ||
|
||
--- | ||
|
||
## Required experience | ||
|
||
To successfully complete this course, you’ll need: | ||
|
||
- **Dagster familiarity** - You'll need to know the basics of Dagster to complete this course. **If you've never used Dagster before or want a refresher before getting started**, check out the [Dagster Essentials course](https://courses.dagster.io/courses/dagster-essentials). | ||
|
||
- **Docker knowledge** - We will provide you with everything you need to run everything around Docker. But being able to navigate around the basics of Docker will be helpful. | ||
|
||
--- | ||
|
||
## Getting help | ||
|
||
If you'd like some assistance while working through this course, reach out to the Dagster community on [Slack](https://dagster.io/slack) in the `#dagster-university` channel. **Note**: The Dagster team is generally available during US business hours. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,23 @@ | ||
--- | ||
title: "Lesson 1: What is ETL?" | ||
module: 'dagster_etl' | ||
lesson: '1' | ||
--- | ||
|
||
# What is ETL? | ||
|
||
ETL stands for Extract, Transform, Load and is the process of consolidating data from various upstream sources into a single storage layer. These upstream sources often span multiple systems and data formats: including application databases, third-party services, and raw files. To fully leverage this data, it’s typically best to bring everything into one centralized location, traditionally a data warehouse or data lake, where it can be standardized and made usable across the organization. | ||
|
||
 | ||
|
||
## ETL vs ELT | ||
|
||
A quick note on definitions. If you're familiar with ETL, you may have also encountered ELT. The two approaches are very similar, but as the acronym suggests, the key difference is when the transformation happens. In ELT, data is loaded first into the destination system, and transformed afterward. | ||
|
||
With the rise of modern data warehouses and lakes that support semi-structured and unstructured data, it's become less critical to transform data into a strict schema before loading. As a result, ETL and ELT are increasingly used interchangeably. Throughout this course, we’ll refer to the process as ETL, even if some examples technically follow the ELT pattern. | ||
|
||
dehume marked this conversation as resolved.
Show resolved
Hide resolved
|
||
## The Importance of ETL | ||
|
||
No matter the industry, ETL is foundational to data systems and applications. When implemented effectively, your data becomes a strategic moat that powers everything from operational dashboards to machine learning pipelines. Whether you're building classic BI reports or cutting-edge AI products, the value lies less in the tools and more in the quality and structure of your data. | ||
|
||
Even in emerging areas like large language models (LLMs), it's not the model itself that defines success, but the clean, curated datasets used to generate embeddings and provide meaningful context. In short, great data makes great systems. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,29 @@ | ||
--- | ||
title: "Lesson 1: ETL and Dagster" | ||
module: 'dagster_etl' | ||
lesson: '1' | ||
--- | ||
|
||
# ETL and Dagster | ||
|
||
One of Dagster’s core strengths is providing full lineage visibility. This often starts by consolidating disparate datasets. Think about your own data stack. You likely have many different systems collecting and storing data that you want to use. Managing this complexity at scale becomes overwhelming without a dedicated system in place to account for the work being done across your pipelines. | ||
|
||
When visualizing your data stack, it’s often helpful to think of data flowing from left to right. On the far left, you typically find your raw data sources and the ETL processes that bring that data into your platform. This is a logical starting point for building out a data platform: focusing on ETL assets helps you concentrate on the most important datasets and avoid duplicating effort. | ||
|
||
 | ||
|
||
Dagster is particularly well-suited for managing ETL pipelines because source assets are often reused across multiple downstream projects. For example, if you're ingesting data from an application database, that data may feed into both analytics dashboards and machine learning workflows. This is where Dagster’s asset-based perspective shines by helping you reason about data dependencies and usage across your organization. | ||
|
||
 | ||
|
||
## ETL and Dagster assets | ||
|
||
Consider a pipeline that ingests data from your application database, you're likely pulling in multiple tables or objects, each destined for a specific schema and table in your data warehouse. | ||
|
||
 | ||
|
||
Each of these entities should be tracked as its own asset, so you can associate downstream processes with each one individually. That granularity gives you the ability to monitor, reason about, and recover from failures more effectively. | ||
|
||
For example, if one source table fails to ingest, Dagster allows you to quickly understand which downstream assets and applications are impacted. This level of observability and control is what makes asset-based orchestration so powerful — especially in the context of managing critical ETL pipelines. | ||
|
||
 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,19 @@ | ||
--- | ||
title: "Lesson 1: Project preview" | ||
module: 'dbt_etl' | ||
lesson: '1' | ||
--- | ||
|
||
# Project preview | ||
|
||
In this course, we’ll focus on ETL and how to manage data ingestion using Dagster. All of the examples will walk through real-world ETL workflows you're likely to encounter, covering a variety of data sources and the unique challenges they present. | ||
|
||
By the end of the course, you will: | ||
|
||
- Create scheduled and event-driven pipelines to ingest files | ||
- Build a custom resource to pull data from an external API | ||
- Use Embedded ETL (with dlt) to build more resilient applications | ||
- Replicate data across databases | ||
- Refactor your code using Dagster Components for better modularity and reuse | ||
|
||
If you get stuck or want to jump ahead, check out the [finished project here on GitHub](https://github.com/dagster-io/project-dagster-university/tree/main/dagster_university/dagster_and_etl/dagster_and_etl/completed). |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,14 @@ | ||
--- | ||
title: "Lesson 2: Set up requirements" | ||
module: 'dagster_etl' | ||
lesson: '2' | ||
--- | ||
|
||
# Set up requirements | ||
|
||
This is an interactive class where you will be coding. In order to follow along you can either run the project locally on your own machine or work with the code in Github Codespaces where all requirements will be set. | ||
dehume marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
You will only need to follow the set up for one of these options, please skip the other: | ||
|
||
- [Local Development](/dagster-etl/lesson-2/1-set-up-local) | ||
- [Github Codespaces](/dagster-etl/lesson-2/2-set-up-codespace) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,63 @@ | ||
--- | ||
title: "Lesson 2: Set up local" | ||
module: 'dagster_etl' | ||
lesson: '2' | ||
--- | ||
|
||
# Set up local | ||
|
||
- **To install git.** Refer to the [Git documentation](https://github.com/git-guides/install-git) if you don’t have this installed. | ||
- **To have Python installed.** Dagster supports Python 3.9 - 3.12. | ||
- **To install a package manager**. To manage the python packages, we recommend [`uv`]((https://docs.astral.sh/uv/)) which Dagster uses internally. | ||
|
||
--- | ||
|
||
## Clone the Dagster University project | ||
|
||
Run the following to clone the project. | ||
|
||
```bash | ||
git clone git@github.com:dagster-io/project-dagster-university.git | ||
``` | ||
|
||
After cloning the Dagster University project, you’ll want to navigate to specific course within the repository. | ||
|
||
``` | ||
cd dagster_university/dagster_and_etl | ||
``` | ||
|
||
## Install the dependencies | ||
|
||
**uv** | ||
|
||
To install the python dependencies with [uv](https://docs.astral.sh/uv/). | ||
|
||
```bash | ||
uv sync | ||
``` | ||
|
||
This will create a virtual environment that you can now use. | ||
|
||
```bash | ||
source .venv/bin/activate | ||
``` | ||
|
||
**pip** | ||
|
||
Create the virtual environment. | ||
|
||
```bash | ||
python3 -m venv .venv | ||
``` | ||
|
||
Enter the virtual environment. | ||
|
||
```bash | ||
source .venv/bin/activate | ||
``` | ||
|
||
Install the packages. | ||
|
||
```bash | ||
pip install -e ".[dev]" | ||
``` |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,55 @@ | ||
--- | ||
title: "Lesson 2: Set up with Github Codespaces" | ||
module: 'dagster_etl' | ||
lesson: '2' | ||
--- | ||
|
||
# Set up with Github Codespace | ||
|
||
Instead of you setting up a local environment, you can use [Github Codespaces](https://github.com/features/codespaces). This will allow you to work through this course and edit code in this repository in a cloud based environment. | ||
|
||
## Creating a Github Codespace | ||
|
||
There are unique Codespaces for the different courses in Dagster University. Be sure to select the create one creating a Codespace. | ||
|
||
1. While logged into Github, go to the [Codespaces page](https://github.com/codespaces). | ||
2. In the top right, select "New Codespace" | ||
3. Create a Codespace using the following. | ||
|
||
| Field | Value | | ||
|--- | --- | | ||
| Repository | dagster-io/project-dagster-university | | ||
| Branch | main | | ||
| Dev container configuration | Dagster & ETL | | ||
| Region | US East | | ||
| Machine type | 2-core | | ||
|
||
 | ||
|
||
4. Click "Create codespace" | ||
|
||
The first time you create a codespace it may take a minute for everything to start. You will then be dropped in an interactive editor containing the code for the entire Dagster University repository. | ||
|
||
## Working in the Codespace | ||
|
||
In the terminal of the Codespace IDE the bottom navigate to the specific course. | ||
|
||
```bash | ||
cd dagster_university/dagster_and_etl | ||
``` | ||
|
||
To ensure everything is working you can launch the Dagster UI. | ||
|
||
```bash | ||
dagster dev | ||
``` | ||
|
||
After Dagster starts running you will be prompted to open the Dagster UI within your browser. Click "Open in Browser". | ||
|
||
 | ||
|
||
## Stopping your Github Codespace | ||
|
||
Be sure to stop your Codespace when you are not using it. Github provides personal accounts [120 cores hours per month](https://docs.github.com/en/billing/managing-billing-for-your-products/managing-billing-for-github-codespaces/about-billing-for-github-codespaces#monthly-included-storage-and-core-hours-for-personal-accounts). | ||
|
||
 |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.