-
Notifications
You must be signed in to change notification settings - Fork 60
MVP ETL Module #90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
dehume
wants to merge
9
commits into
main
Choose a base branch
from
dennis/ce-814-etl-code-example
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
MVP ETL Module #90
Changes from all commits
Commits
Show all changes
9 commits
Select commit
Hold shift + click to select a range
20fe2c4
Check
dehume d8efbd6
Merge branch 'main' into dennis/ce-814-etl-code-example
dehume 0f7f135
gitkeep staging dir
dehume 83a81ac
dg
dehume bb0e5cb
Check
dehume bb1091a
Check
dehume 94ea2e2
Check
dehume 6d1b59d
Check
dehume ecd21d0
Check
dehume File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
FROM mcr.microsoft.com/devcontainers/python:0-3.11-bullseye | ||
ENV PYTHONUNBUFFERED 1 | ||
|
||
COPY --from=ghcr.io/astral-sh/uv:0.6.10 /uv /bin/uv | ||
|
||
COPY dagster_university/dagster_and_etl/pyproject.toml . | ||
RUN uv pip install -r pyproject.toml --system |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,15 @@ | ||
{ | ||
"name": "Dagster & ETL", | ||
"build": { | ||
"dockerfile": "Dockerfile", | ||
"context": "../.." | ||
}, | ||
"forwardPorts": [ | ||
3000 | ||
], | ||
"portsAttributes": { | ||
"3000": { | ||
"label": "Dagster" | ||
} | ||
} | ||
} |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,20 @@ | ||
name: quality-check-dagster-and-etl | ||
|
||
on: | ||
schedule: | ||
- cron: "0 0 * * 0" | ||
|
||
pull_request: | ||
types: [opened, synchronize, reopened] | ||
paths: | ||
- dagster_university/dagster_and_etl/** | ||
|
||
jobs: | ||
check: | ||
if: github.event.pull_request.draft == false | ||
uses: ./.github/workflows/template-quality-check.yml | ||
with: | ||
working_directory: ./dagster_university/dagster_and_etl | ||
# TODO: Disable integration tests from GHA | ||
# postgres image has no windows/amd64 | ||
windows_pytest_cmd: uv run pytest dagster_and_etl/completed -v -m "not integration" | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -186,4 +186,5 @@ tmp*/ | |
# dbt | ||
.user.yml | ||
|
||
postgres_data/ | ||
postgres_data/ | ||
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,53 @@ | ||
--- | ||
title: Dagster ETL | ||
--- | ||
|
||
- Lesson 1: Introduction to ETL with Dagster | ||
- [About this course](/dagster-etl/lesson-1/0-about-this-course) | ||
- [What is ETL?](/dagster-etl/lesson-1/1-what-is-etl) | ||
- [ETL and Dagster](/dagster-etl/lesson-1/2-etl-and-dagster) | ||
- [Project preview](/dagster-etl/lesson-1/3-project-preview) | ||
|
||
- Lesson 2: Installation & Setup | ||
- [Requirements](/dagster-etl/lesson-2/0-requirements) | ||
- [Set up local](/dagster-etl/lesson-2/1-set-up-local) | ||
- [Set up Codespace](/dagster-etl/lesson-2/2-set-up-codespace) | ||
|
||
- Lesson 3: Loading Static Data into DuckDB | ||
- [Overview](/dagster-etl/lesson-3/0-overview) | ||
- [File import](/dagster-etl/lesson-3/1-file-import) | ||
- [Data integrity](/dagster-etl/lesson-3/2-data-integrity) | ||
- [Partitions](/dagster-etl/lesson-3/3-partitions) | ||
- [Complex partitions](/dagster-etl/lesson-3/4-complex-partitions) | ||
- [Triggering partitions](/dagster-etl/lesson-3/5-triggering-partitions) | ||
- [Cloud storage](/dagster-etl/lesson-3/6-cloud-storage) | ||
|
||
- Lesson 4: ETL with APIs | ||
- [Overview](/dagster-etl/lesson-4/0-overview) | ||
- [APIs](/dagster-etl/lesson-4/1-apis) | ||
- [API resource](/dagster-etl/lesson-4/2-api-resource) | ||
- [ETL with API](/dagster-etl/lesson-4/3-etl-with-api) | ||
- [API Dagster](/dagster-etl/lesson-4/4-api-dagster-assets) | ||
- [Triggering API jobs](/dagster-etl/lesson-4/5-triggering-api-jobs) | ||
- [Backfilling from APIs](/dagster-etl/lesson-4/6-backfilling-from-apis) | ||
|
||
- Lesson 5: Embedded ETL | ||
- [Overview](/dagster-etl/lesson-5/0-overview) | ||
- [dlt](/dagster-etl/lesson-5/1-dlt) | ||
- [Basic dlt](/dagster-etl/lesson-5/2-basic-dlt) | ||
- [Dagster and dlt](/dagster-etl/lesson-5/3-dagster-and-dlt) | ||
- [Refactoring static data with dlt](/dagster-etl/lesson-5/4-refactoring-static-data-with-dlt) | ||
- [Refactoring APIs with dlt](/dagster-etl/lesson-5/5-refactoring-apis-with-dlt) | ||
|
||
- Lesson 6: Database replication | ||
- [Overview](/dagster-etl/lesson-6/0-overview) | ||
- [Database replication](/dagster-etl/lesson-6/1-database-replication) | ||
- [dlt database replication set up](/dagster-etl/lesson-6/2-dlt-database-replication-set-up) | ||
- [dlt database assets](/dagster-etl/lesson-6/3-dlt-database-assets) | ||
- [Executing the pipeline](/dagster-etl/lesson-6/4-executing-the-pipeline) | ||
|
||
- Lesson 7: ETL with Components | ||
- [Overview](/dagster-etl/lesson-7/0-overview) | ||
- [Dagster Components](/dagster-etl/lesson-7/1-dagster-components) | ||
- [dlt with components](/dagster-etl/lesson-7/2-dlt-with-components) | ||
- [Using components](/dagster-etl/lesson-7/3-using-components) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,29 @@ | ||
--- | ||
title: "Lesson 1: About this course & getting help" | ||
module: 'dagster_etl' | ||
lesson: '1' | ||
--- | ||
|
||
# About this course | ||
|
||
This course is geared towards those who have some familiarity Dagster and want to learn move about ETL. You don't need to be an expert, but you should know your way around a Dagster project. | ||
|
||
In this course, you’ll learn how to orchestrate ETL pipelines. We will discuss common ETL situations and some of their pitfalls and how to effectively layer in Dagster to make managing these pipelines easier. | ||
|
||
You’ll load static files into a data warehouse using schedules, sensors, and partitions. You'll explore the nuances of extracting data from APIs, streamline your workflows with ETL frameworks (dlt), and replicate data across databases. Finally, you'll see how Dagster Components can help you build production-quality ETL solutions with just a few lines of code. | ||
|
||
--- | ||
|
||
## Required experience | ||
|
||
To successfully complete this course, you’ll need: | ||
|
||
- **Dagster familiarity** - You'll need to know the basics of Dagster to complete this course. **If you've never used Dagster before or want a refresher before getting started**, check out the [Dagster Essentials course](https://courses.dagster.io/courses/dagster-essentials). | ||
|
||
- **Docker knowledge** - We will provide you with everything you need to run everything around Docker. But being able to navigate around the basics of Docker will be helpful. | ||
|
||
--- | ||
|
||
## Getting help | ||
|
||
If you'd like some assistance while working through this course, reach out to the Dagster community on [Slack](https://dagster.io/slack) in the `#dagster-university` channel. **Note**: The Dagster team is generally available during US business hours. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,23 @@ | ||
--- | ||
title: "Lesson 1: What is ETL?" | ||
module: 'dagster_etl' | ||
lesson: '1' | ||
--- | ||
|
||
# What is ETL? | ||
|
||
ETL stands for Extract, Transform, Load and is the process of consolidating data from various upstream sources into a single storage layer. These upstream sources often span multiple systems and data formats: including application databases, third-party services, and raw files. To fully leverage this data, it’s typically best to bring everything into one centralized location, traditionally a data warehouse or data lake, where it can be standardized and made usable. | ||
|
||
 | ||
|
||
## ETL vs ELT | ||
|
||
A quick note on definitions. If you're familiar with ETL, you may have also encountered ELT. The two approaches are very similar, but as the acronym suggests, the key difference is when the transformation happens. In ELT, data is loaded first into the destination system, and transformed afterward. | ||
|
||
With the rise of modern data warehouses and lakes that support semi-structured and unstructured data, it's become less critical to transform data into a strict schema before loading. As a result, ETL and ELT are increasingly used interchangeably. Throughout this course, we’ll refer to the process as ETL, even if some examples technically follow the ELT pattern. | ||
|
||
## The Importance of ETL | ||
|
||
No matter the industry, ETL is foundational to data systems and applications. When implemented effectively, your data becomes a strategic moat that powers everything from operational dashboards to machine learning pipelines. Whether you're building classic BI reports or cutting-edge AI products, the value lies less in the tools and more in the quality and structure of your data. | ||
|
||
Even in emerging areas like large language models (LLMs), it's not the model itself that defines success, but the clean, curated datasets used to generate embeddings and provide meaningful context. In short, great data makes great systems. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,31 @@ | ||
--- | ||
title: "Lesson 1: ETL and Dagster" | ||
module: 'dagster_etl' | ||
lesson: '1' | ||
--- | ||
|
||
# ETL and Dagster | ||
|
||
One of Dagster’s core strengths is breaking down data pipelines to their individual assets. This provides full lineage and visibility between different systems. | ||
|
||
When visualizing your data stack, it’s often helpful to think of data flowing from left to right. On the far left, you typically find your raw data sources and the ETL processes that bring that data into your platform. This is a logical starting point for building out a data platform: focusing on ETL assets helps you concentrate on the most important datasets and avoid duplicating effort. | ||
|
||
 | ||
|
||
This asset based approach is what makes Dagster particularly well-suited for managing ETL pipelines. Because source assets are so fundamental to building with data, they tend to be reused across multiple projects. For example, if you're ingesting data from an application database, that data may feed into both analytics dashboards and machine learning workflows. | ||
|
||
Without an asset based approach, it can get lost that multiple processes rely on the same sources. | ||
|
||
 | ||
|
||
## Source asset granularity | ||
|
||
Another reason an asset based approach tends to work well for ETL is that data sources tend to represent multiple individual entities. Consider a pipeline that ingests data from your application database, you're likely pulling in multiple schema or tables, each used by specific data applications. | ||
|
||
 | ||
|
||
Each of these entities should be tracked as its own asset, so you can associate downstream processes with each one individually. That granularity gives you the ability to monitor, reason about, and recover from failures more effectively. | ||
|
||
For example, if one source table fails to ingest, an asset based approach allows you to quickly understand which downstream assets and applications are impacted. This level of observability and control is what makes asset-based orchestration so powerful — especially in the context of managing critical ETL pipelines. | ||
|
||
 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,19 @@ | ||
--- | ||
title: "Lesson 1: Project preview" | ||
module: 'dbt_etl' | ||
lesson: '1' | ||
--- | ||
|
||
# Project preview | ||
|
||
In this course, we’ll focus on ETL and how to manage data ingestion using Dagster. All of the examples will walk through real-world ETL workflows you're likely to encounter, covering a variety of data sources and the unique challenges they present. | ||
|
||
By the end of the course, you will: | ||
|
||
- Create scheduled and event-driven pipelines to ingest files | ||
- Build a custom resource to pull data from an external API | ||
- Use Embedded ETL (with dlt) to build more resilient applications | ||
- Replicate data across databases | ||
- Refactor your code using Dagster Components for better modularity and reuse | ||
|
||
If you get stuck or want to jump ahead, check out the [finished project here on GitHub](https://github.com/dagster-io/project-dagster-university/tree/main/dagster_university/dagster_and_etl/dagster_and_etl/completed). |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,14 @@ | ||
--- | ||
title: "Lesson 2: Set up requirements" | ||
module: 'dagster_etl' | ||
lesson: '2' | ||
--- | ||
|
||
# Set up requirements | ||
|
||
This is an interactive class where you will be coding. In order to follow along you can either run the project locally on your own machine or work with the code in Github Codespaces where all requirements will be set. | ||
|
||
You will only need to follow the set up for one of these options, please skip the other: | ||
|
||
- [Local Development](/dagster-etl/lesson-2/1-set-up-local) | ||
- [Github Codespaces](/dagster-etl/lesson-2/2-set-up-codespace) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,63 @@ | ||
--- | ||
title: "Lesson 2: Set up local" | ||
module: 'dagster_etl' | ||
lesson: '2' | ||
--- | ||
|
||
# Set up local | ||
|
||
- **To install git.** Refer to the [Git documentation](https://github.com/git-guides/install-git) if you don’t have this installed. | ||
- **To have Python installed.** Dagster supports Python 3.9 - 3.12. | ||
- **To install a package manager**. To manage the python packages, we recommend [`uv`]((https://docs.astral.sh/uv/)) which Dagster uses internally. | ||
|
||
--- | ||
|
||
## Clone the Dagster University project | ||
|
||
Run the following to clone the project. | ||
|
||
```bash | ||
git clone git@github.com:dagster-io/project-dagster-university.git | ||
``` | ||
|
||
After cloning the Dagster University project, you’ll want to navigate to specific course within the repository. | ||
|
||
``` | ||
cd dagster_university/dagster_and_etl | ||
``` | ||
|
||
## Install the dependencies | ||
|
||
**uv** | ||
|
||
To install the python dependencies with [uv](https://docs.astral.sh/uv/). | ||
|
||
```bash | ||
uv sync | ||
``` | ||
|
||
This will create a virtual environment that you can now use. | ||
|
||
```bash | ||
source .venv/bin/activate | ||
``` | ||
|
||
**pip** | ||
|
||
Create the virtual environment. | ||
|
||
```bash | ||
python3 -m venv .venv | ||
``` | ||
|
||
Enter the virtual environment. | ||
|
||
```bash | ||
source .venv/bin/activate | ||
``` | ||
|
||
Install the packages. | ||
|
||
```bash | ||
pip install -e ".[dev]" | ||
``` |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,55 @@ | ||
--- | ||
title: "Lesson 2: Set up with Github Codespaces" | ||
module: 'dagster_etl' | ||
lesson: '2' | ||
--- | ||
|
||
# Set up with Github Codespace | ||
|
||
Instead of you setting up a local environment, you can use [Github Codespaces](https://github.com/features/codespaces). This will allow you to work through this course and edit code in this repository in a cloud based environment. | ||
|
||
## Creating a Github Codespace | ||
|
||
There are unique Codespaces for the different courses in Dagster University. Be sure to select the create one creating a Codespace. | ||
|
||
1. While logged into Github, go to the [Codespaces page](https://github.com/codespaces). | ||
2. In the top right, select "New Codespace" | ||
3. Create a Codespace using the following. | ||
|
||
| Field | Value | | ||
|--- | --- | | ||
| Repository | dagster-io/project-dagster-university | | ||
| Branch | main | | ||
| Dev container configuration | Dagster & ETL | | ||
| Region | US East | | ||
| Machine type | 2-core | | ||
|
||
 | ||
|
||
4. Click "Create codespace" | ||
|
||
The first time you create a codespace it may take a minute for everything to start. You will then be dropped in an interactive editor containing the code for the entire Dagster University repository. | ||
|
||
## Working in the Codespace | ||
|
||
In the terminal of the Codespace IDE the bottom navigate to the specific course. | ||
|
||
```bash | ||
cd dagster_university/dagster_and_etl | ||
``` | ||
|
||
To ensure everything is working you can launch the Dagster UI. | ||
|
||
```bash | ||
dagster dev | ||
``` | ||
|
||
After Dagster starts running you will be prompted to open the Dagster UI within your browser. Click "Open in Browser". | ||
|
||
 | ||
|
||
## Stopping your Github Codespace | ||
|
||
Be sure to stop your Codespace when you are not using it. Github provides personal accounts [120 cores hours per month](https://docs.github.com/en/billing/managing-billing-for-your-products/managing-billing-for-github-codespaces/about-billing-for-github-codespaces#monthly-included-storage-and-core-hours-for-personal-accounts). | ||
|
||
 |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Check warning
Code scanning / CodeQL
Workflow does not contain permissions Medium
Copilot Autofix
AI 1 day ago
To fix the issue, we will add a
permissions
block at the root level of the workflow. Since this workflow appears to perform quality checks and does not seem to require write access, we will set the permissions tocontents: read
. This ensures that theGITHUB_TOKEN
has only the minimal permissions necessary for the workflow to execute.