Skip to content

MVP ETL Module #90

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 9 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions .devcontainer/dagster-and-etl/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
FROM mcr.microsoft.com/devcontainers/python:0-3.11-bullseye
ENV PYTHONUNBUFFERED 1

COPY --from=ghcr.io/astral-sh/uv:0.6.10 /uv /bin/uv

COPY dagster_university/dagster_and_etl/pyproject.toml .
RUN uv pip install -r pyproject.toml --system
15 changes: 15 additions & 0 deletions .devcontainer/dagster-and-etl/devcontainer.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
{
"name": "Dagster & ETL",
"build": {
"dockerfile": "Dockerfile",
"context": "../.."
},
"forwardPorts": [
3000
],
"portsAttributes": {
"3000": {
"label": "Dagster"
}
}
}
1 change: 1 addition & 0 deletions .github/workflows/quality-check-dagster-and-dbt.yml
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@

jobs:
check:
if: github.event.pull_request.draft == false
uses: ./.github/workflows/template-quality-check.yml
with:
working_directory: ./dagster_university/dagster_and_dbt

Check warning

Code scanning / CodeQL

Workflow does not contain permissions Medium

Actions job or workflow does not limit the permissions of the GITHUB_TOKEN. Consider setting an explicit permissions block, using the following as a minimal starting point: {}
20 changes: 20 additions & 0 deletions .github/workflows/quality-check-dagster-and-etl.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
name: quality-check-dagster-and-etl

on:
schedule:
- cron: "0 0 * * 0"

pull_request:
types: [opened, synchronize, reopened]
paths:
- dagster_university/dagster_and_etl/**

jobs:
check:
if: github.event.pull_request.draft == false
uses: ./.github/workflows/template-quality-check.yml
with:
working_directory: ./dagster_university/dagster_and_etl
# TODO: Disable integration tests from GHA
# postgres image has no windows/amd64
windows_pytest_cmd: uv run pytest dagster_and_etl/completed -v -m "not integration"
Comment on lines +14 to +20

Check warning

Code scanning / CodeQL

Workflow does not contain permissions Medium

Actions job or workflow does not limit the permissions of the GITHUB_TOKEN. Consider setting an explicit permissions block, using the following as a minimal starting point: {}

Copilot Autofix

AI 1 day ago

To fix the issue, we will add a permissions block at the root level of the workflow. Since this workflow appears to perform quality checks and does not seem to require write access, we will set the permissions to contents: read. This ensures that the GITHUB_TOKEN has only the minimal permissions necessary for the workflow to execute.


Suggested changeset 1
.github/workflows/quality-check-dagster-and-etl.yml

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/.github/workflows/quality-check-dagster-and-etl.yml b/.github/workflows/quality-check-dagster-and-etl.yml
--- a/.github/workflows/quality-check-dagster-and-etl.yml
+++ b/.github/workflows/quality-check-dagster-and-etl.yml
@@ -2,2 +2,5 @@
 
+permissions:
+  contents: read
+
 on:
EOF
@@ -2,2 +2,5 @@

permissions:
contents: read

on:
Copilot is powered by AI and may make mistakes. Always verify output.
1 change: 1 addition & 0 deletions .github/workflows/quality-check-dagster-essentials.yml
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@

jobs:
check:
if: github.event.pull_request.draft == false
uses: ./.github/workflows/template-quality-check.yml
with:
working_directory: ./dagster_university/dagster_essentials

Check warning

Code scanning / CodeQL

Workflow does not contain permissions Medium

Actions job or workflow does not limit the permissions of the GITHUB_TOKEN. Consider setting an explicit permissions block, using the following as a minimal starting point: {}
1 change: 1 addition & 0 deletions .github/workflows/quality-check-dagster-testing.yml
Original file line number Diff line number Diff line change
Expand Up @@ -11,8 +11,9 @@

jobs:
check:
if: github.event.pull_request.draft == false
uses: ./.github/workflows/template-quality-check.yml
with:
working_directory: ./dagster_university/dagster_testing
# TODO: Disable integration tests from GHA
# postgres image has no windows/amd64
Expand Down
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -186,4 +186,5 @@ tmp*/
# dbt
.user.yml

postgres_data/
postgres_data/

3 changes: 2 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,4 +6,5 @@ Welcome to [Dagster University](https://courses.dagster.io/). This contains all
|-------------|-------------|
| [`Dagster Essentials`](dagster_university/dagster_essentials/README.md) | [Dagster Essentials Course](https://courses.dagster.io/courses/dagster-essentials) |
| [`Dagster & dbt`](dagster_university/dagster_and_dbt/README.md) | [Dagster + dbt Course](https://courses.dagster.io/courses/dagster-dbt) |
| [`Testing with Dagster`](dagster_university/dagster_testing/README.md) | [Testing with Dagster Course](https://courses.dagster.io/courses/dagster-testing) |
| [`Dagster & ETL`](dagster_university/dagster_and_etl/README.md) | |
| [`Testing with Dagster`](dagster_university/dagster_testing/README.md) | [Testing with Dagster Course](https://courses.dagster.io/courses/dagster-testing) |
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ dbt_resource = DbtCliResource(

The code above:

1. Imports the `DbtCliResource` from the `dagster_dbt` package that we installed earlier
2. Imports the `dbt_project` representation we just defined
3. Instantiates a new `DbtCliResource` under the variable name `dbt_resource`
4. Tells the resource that the dbt project to execute is the `dbt_project`
1. Imports the `DbtCliResource` from the `dagster_dbt` package that we installed earlier.
2. Imports the `dbt_project` representation we just defined.
3. Instantiates a new `DbtCliResource` under the variable name `dbt_resource`.
4. Tells the resource that the dbt project to execute is the `dbt_project`.
53 changes: 53 additions & 0 deletions course/pages/dagster-etl.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
---
title: Dagster ETL
---

- Lesson 1: Introduction to ETL with Dagster
- [About this course](/dagster-etl/lesson-1/0-about-this-course)
- [What is ETL?](/dagster-etl/lesson-1/1-what-is-etl)
- [ETL and Dagster](/dagster-etl/lesson-1/2-etl-and-dagster)
- [Project preview](/dagster-etl/lesson-1/3-project-preview)

- Lesson 2: Installation & Setup
- [Requirements](/dagster-etl/lesson-2/0-requirements)
- [Set up local](/dagster-etl/lesson-2/1-set-up-local)
- [Set up Codespace](/dagster-etl/lesson-2/2-set-up-codespace)

- Lesson 3: Loading Static Data into DuckDB
- [Overview](/dagster-etl/lesson-3/0-overview)
- [File import](/dagster-etl/lesson-3/1-file-import)
- [Data integrity](/dagster-etl/lesson-3/2-data-integrity)
- [Partitions](/dagster-etl/lesson-3/3-partitions)
- [Complex partitions](/dagster-etl/lesson-3/4-complex-partitions)
- [Triggering partitions](/dagster-etl/lesson-3/5-triggering-partitions)
- [Cloud storage](/dagster-etl/lesson-3/6-cloud-storage)

- Lesson 4: ETL with APIs
- [Overview](/dagster-etl/lesson-4/0-overview)
- [APIs](/dagster-etl/lesson-4/1-apis)
- [API resource](/dagster-etl/lesson-4/2-api-resource)
- [ETL with API](/dagster-etl/lesson-4/3-etl-with-api)
- [API Dagster](/dagster-etl/lesson-4/4-api-dagster-assets)
- [Triggering API jobs](/dagster-etl/lesson-4/5-triggering-api-jobs)
- [Backfilling from APIs](/dagster-etl/lesson-4/6-backfilling-from-apis)

- Lesson 5: Embedded ETL
- [Overview](/dagster-etl/lesson-5/0-overview)
- [dlt](/dagster-etl/lesson-5/1-dlt)
- [Basic dlt](/dagster-etl/lesson-5/2-basic-dlt)
- [Dagster and dlt](/dagster-etl/lesson-5/3-dagster-and-dlt)
- [Refactoring static data with dlt](/dagster-etl/lesson-5/4-refactoring-static-data-with-dlt)
- [Refactoring APIs with dlt](/dagster-etl/lesson-5/5-refactoring-apis-with-dlt)

- Lesson 6: Database replication
- [Overview](/dagster-etl/lesson-6/0-overview)
- [Database replication](/dagster-etl/lesson-6/1-database-replication)
- [dlt database replication set up](/dagster-etl/lesson-6/2-dlt-database-replication-set-up)
- [dlt database assets](/dagster-etl/lesson-6/3-dlt-database-assets)
- [Executing the pipeline](/dagster-etl/lesson-6/4-executing-the-pipeline)

- Lesson 7: ETL with Components
- [Overview](/dagster-etl/lesson-7/0-overview)
- [Dagster Components](/dagster-etl/lesson-7/1-dagster-components)
- [dlt with components](/dagster-etl/lesson-7/2-dlt-with-components)
- [Using components](/dagster-etl/lesson-7/3-using-components)
29 changes: 29 additions & 0 deletions course/pages/dagster-etl/lesson-1/0-about-this-course.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
---
title: "Lesson 1: About this course & getting help"
module: 'dagster_etl'
lesson: '1'
---

# About this course

This course is geared towards those who have some familiarity Dagster and want to learn move about ETL. You don't need to be an expert, but you should know your way around a Dagster project.

In this course, you’ll learn how to orchestrate ETL pipelines. We will discuss common ETL situations and some of their pitfalls and how to effectively layer in Dagster to make managing these pipelines easier.

You’ll load static files into a data warehouse using schedules, sensors, and partitions. You'll explore the nuances of extracting data from APIs, streamline your workflows with ETL frameworks (dlt), and replicate data across databases. Finally, you'll see how Dagster Components can help you build production-quality ETL solutions with just a few lines of code.

---

## Required experience

To successfully complete this course, you’ll need:

- **Dagster familiarity** - You'll need to know the basics of Dagster to complete this course. **If you've never used Dagster before or want a refresher before getting started**, check out the [Dagster Essentials course](https://courses.dagster.io/courses/dagster-essentials).

- **Docker knowledge** - We will provide you with everything you need to run everything around Docker. But being able to navigate around the basics of Docker will be helpful.

---

## Getting help

If you'd like some assistance while working through this course, reach out to the Dagster community on [Slack](https://dagster.io/slack) in the `#dagster-university` channel. **Note**: The Dagster team is generally available during US business hours.
23 changes: 23 additions & 0 deletions course/pages/dagster-etl/lesson-1/1-what-is-etl.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
---
title: "Lesson 1: What is ETL?"
module: 'dagster_etl'
lesson: '1'
---

# What is ETL?

ETL stands for Extract, Transform, Load and is the process of consolidating data from various upstream sources into a single storage layer. These upstream sources often span multiple systems and data formats: including application databases, third-party services, and raw files. To fully leverage this data, it’s typically best to bring everything into one centralized location, traditionally a data warehouse or data lake, where it can be standardized and made usable.

![ETL](/images/dagster-etl/lesson-1/what-is-etl.png)

## ETL vs ELT

A quick note on definitions. If you're familiar with ETL, you may have also encountered ELT. The two approaches are very similar, but as the acronym suggests, the key difference is when the transformation happens. In ELT, data is loaded first into the destination system, and transformed afterward.

With the rise of modern data warehouses and lakes that support semi-structured and unstructured data, it's become less critical to transform data into a strict schema before loading. As a result, ETL and ELT are increasingly used interchangeably. Throughout this course, we’ll refer to the process as ETL, even if some examples technically follow the ELT pattern.

## The Importance of ETL

No matter the industry, ETL is foundational to data systems and applications. When implemented effectively, your data becomes a strategic moat that powers everything from operational dashboards to machine learning pipelines. Whether you're building classic BI reports or cutting-edge AI products, the value lies less in the tools and more in the quality and structure of your data.

Even in emerging areas like large language models (LLMs), it's not the model itself that defines success, but the clean, curated datasets used to generate embeddings and provide meaningful context. In short, great data makes great systems.
31 changes: 31 additions & 0 deletions course/pages/dagster-etl/lesson-1/2-etl-and-dagster.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
---
title: "Lesson 1: ETL and Dagster"
module: 'dagster_etl'
lesson: '1'
---

# ETL and Dagster

One of Dagster’s core strengths is breaking down data pipelines to their individual assets. This provides full lineage and visibility between different systems.

When visualizing your data stack, it’s often helpful to think of data flowing from left to right. On the far left, you typically find your raw data sources and the ETL processes that bring that data into your platform. This is a logical starting point for building out a data platform: focusing on ETL assets helps you concentrate on the most important datasets and avoid duplicating effort.

![ETL assets 1](/images/dagster-etl/lesson-1/etl-assets-1.png)

This asset based approach is what makes Dagster particularly well-suited for managing ETL pipelines. Because source assets are so fundamental to building with data, they tend to be reused across multiple projects. For example, if you're ingesting data from an application database, that data may feed into both analytics dashboards and machine learning workflows.

Without an asset based approach, it can get lost that multiple processes rely on the same sources.

![ETL as assets 2](/images/dagster-etl/lesson-1/etl-assets-2.png)

## Source asset granularity

Another reason an asset based approach tends to work well for ETL is that data sources tend to represent multiple individual entities. Consider a pipeline that ingests data from your application database, you're likely pulling in multiple schema or tables, each used by specific data applications.

![ETL as assets 1](/images/dagster-etl/lesson-1/etl-as-assets-1.png)

Each of these entities should be tracked as its own asset, so you can associate downstream processes with each one individually. That granularity gives you the ability to monitor, reason about, and recover from failures more effectively.

For example, if one source table fails to ingest, an asset based approach allows you to quickly understand which downstream assets and applications are impacted. This level of observability and control is what makes asset-based orchestration so powerful — especially in the context of managing critical ETL pipelines.

![ETL as assets 2](/images/dagster-etl/lesson-1/etl-as-assets-2.png)
19 changes: 19 additions & 0 deletions course/pages/dagster-etl/lesson-1/3-project-preview.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
---
title: "Lesson 1: Project preview"
module: 'dbt_etl'
lesson: '1'
---

# Project preview

In this course, we’ll focus on ETL and how to manage data ingestion using Dagster. All of the examples will walk through real-world ETL workflows you're likely to encounter, covering a variety of data sources and the unique challenges they present.

By the end of the course, you will:

- Create scheduled and event-driven pipelines to ingest files
- Build a custom resource to pull data from an external API
- Use Embedded ETL (with dlt) to build more resilient applications
- Replicate data across databases
- Refactor your code using Dagster Components for better modularity and reuse

If you get stuck or want to jump ahead, check out the [finished project here on GitHub](https://github.com/dagster-io/project-dagster-university/tree/main/dagster_university/dagster_and_etl/dagster_and_etl/completed).
14 changes: 14 additions & 0 deletions course/pages/dagster-etl/lesson-2/0-requirements.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
---
title: "Lesson 2: Set up requirements"
module: 'dagster_etl'
lesson: '2'
---

# Set up requirements

This is an interactive class where you will be coding. In order to follow along you can either run the project locally on your own machine or work with the code in Github Codespaces where all requirements will be set.

You will only need to follow the set up for one of these options, please skip the other:

- [Local Development](/dagster-etl/lesson-2/1-set-up-local)
- [Github Codespaces](/dagster-etl/lesson-2/2-set-up-codespace)
63 changes: 63 additions & 0 deletions course/pages/dagster-etl/lesson-2/1-set-up-local.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
---
title: "Lesson 2: Set up local"
module: 'dagster_etl'
lesson: '2'
---

# Set up local

- **To install git.** Refer to the [Git documentation](https://github.com/git-guides/install-git) if you don’t have this installed.
- **To have Python installed.** Dagster supports Python 3.9 - 3.12.
- **To install a package manager**. To manage the python packages, we recommend [`uv`]((https://docs.astral.sh/uv/)) which Dagster uses internally.

---

## Clone the Dagster University project

Run the following to clone the project.

```bash
git clone git@github.com:dagster-io/project-dagster-university.git
```

After cloning the Dagster University project, you’ll want to navigate to specific course within the repository.

```
cd dagster_university/dagster_and_etl
```

## Install the dependencies

**uv**

To install the python dependencies with [uv](https://docs.astral.sh/uv/).

```bash
uv sync
```

This will create a virtual environment that you can now use.

```bash
source .venv/bin/activate
```

**pip**

Create the virtual environment.

```bash
python3 -m venv .venv
```

Enter the virtual environment.

```bash
source .venv/bin/activate
```

Install the packages.

```bash
pip install -e ".[dev]"
```
55 changes: 55 additions & 0 deletions course/pages/dagster-etl/lesson-2/2-set-up-codespace.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
---
title: "Lesson 2: Set up with Github Codespaces"
module: 'dagster_etl'
lesson: '2'
---

# Set up with Github Codespace

Instead of you setting up a local environment, you can use [Github Codespaces](https://github.com/features/codespaces). This will allow you to work through this course and edit code in this repository in a cloud based environment.

## Creating a Github Codespace

There are unique Codespaces for the different courses in Dagster University. Be sure to select the create one creating a Codespace.

1. While logged into Github, go to the [Codespaces page](https://github.com/codespaces).
2. In the top right, select "New Codespace"
3. Create a Codespace using the following.

| Field | Value |
|--- | --- |
| Repository | dagster-io/project-dagster-university |
| Branch | main |
| Dev container configuration | Dagster & ETL |
| Region | US East |
| Machine type | 2-core |

![Codespace Create](/images/shared/codespaces/codespaces-create.png)

4. Click "Create codespace"

The first time you create a codespace it may take a minute for everything to start. You will then be dropped in an interactive editor containing the code for the entire Dagster University repository.

## Working in the Codespace

In the terminal of the Codespace IDE the bottom navigate to the specific course.

```bash
cd dagster_university/dagster_and_etl
```

To ensure everything is working you can launch the Dagster UI.

```bash
dagster dev
```

After Dagster starts running you will be prompted to open the Dagster UI within your browser. Click "Open in Browser".

![Codespace Launch](/images/shared/codespaces/codespaces-launch.png)

## Stopping your Github Codespace

Be sure to stop your Codespace when you are not using it. Github provides personal accounts [120 cores hours per month](https://docs.github.com/en/billing/managing-billing-for-your-products/managing-billing-for-github-codespaces/about-billing-for-github-codespaces#monthly-included-storage-and-core-hours-for-personal-accounts).

![Stop Codespace](/images/shared/codespaces/codespaces-stop.png)
Loading