Dagster ML Multi-Code-Location Template

This repository is a GitHub Template for building modular, production-grade ML/data pipelines with Dagster using a multi-code-location architecture.

Mono-repo default: This template assumes a mono-repo structure, but you can adapt it for a multi-repo setup (see below)
To use: Click "Use this template" on GitHub, then customize code location names, project name, and assets as needed

Note: this is a work-in-progress template as I learn the Dagster framework. I've open-sourced this to gain feedback. Please feel free to create Issues and Pull Requests to improve this template. Thank you.

Overview

This template demonstrates a modular Dagster project with each pipeline stage isolated in its own Python environment. The structure enables clear dependency management and scalable development. The example model is a decode-only Transformer, but you can adapt it for any ML/data workflow.

For more on Dagster concepts, see the Dagster Documentation.

Project Structure

/dagster_cloud.yaml: Defines Dagster code locations for deployment (docs)
/workspace.yaml: Configures local Dagster workspace and code location loading (docs)
/shared_code_location/: Shared Python code (utilities, tokenization, I/O) importable by all code locations
/example_resources/: Example input and output data for the pipeline
/1_etl_code_location/: Data ingestion, tokenization, vocabulary extraction, and splitting (ETL pipeline)
- 1_etl/1_ingest/: Data ingestion assets (uses Dagster resource for raw data)
- 1_etl/2_tokenize/: Tokenization assets
- 1_etl/3_split_data/: Data splitting assets
- 1_etl/4_vocab_from_train_data/: Vocabulary extraction from train split
/2_model_code_location/: Model definition and training
/3_evaluate_code_location/: Model evaluation and metrics
/4_deploy_code_location/: Model deployment, packaging, and serving (outputs to example_resources via resource)

Each code location contains:

assets.py: Dagster assets for the location
definitions.py: Dagster Definitions object for repository registration
requirements.txt, setup.py, pyproject.toml, setup.cfg: Python packaging and dependencies

Using This Template

Click "Use this template" on GitHub.
Rename code locations and packages as needed (e.g., 1_etl_code_location → my_etl_code_location, 1_etl → my_etl).
Edit assets and definitions in each code location to fit your workflow.
Update workspace.yaml and dagster_cloud.yaml to match your code location names and structure.
(Optional) Split into multiple repos: See below for multi-repo setup instructions.

Quickstart with Python CLI (Recommended)

A Python-based CLI is provided in scripts/cli.py to automate setup and common project tasks. The most common commands are:

python scripts/cli.py setup — Set up all Python environments and install dependencies (Dagster installation docs)
python scripts/cli.py clean — Remove all virtual environments and Python caches
python scripts/cli.py dev — Start the Dagster dev webserver for the whole project
python scripts/cli.py dev <location> — Start the dev webserver for a single code location (e.g., python scripts/cli.py dev 1_etl_code_location)
python scripts/cli.py test — Run tests in all code locations
python scripts/cli.py --help — Show help and list available commands and options

Example usage

python scripts/cli.py setup
source .venv/bin/activate
python scripts/cli.py dev  # Loads all code locations as defined in workspace.yaml

See scripts/cli.py for more commands and details.

Mono-Repo vs Multi-Repo

Mono-repo (default): All code locations live in one repository. workspace.yaml and dagster_cloud.yaml reference local directories
Multi-repo: Each code location can be in a separate repository. Update workspace.yaml and dagster_cloud.yaml to point to remote Python packages or Docker images

See README-CONVERT-TO-MULTI-REPO.md for a step-by-step guide to splitting this template into multiple repositories.

Code Location Purposes

1_etl_code_location: Data ingestion, preprocessing, tokenization, and splitting
2_model_code_location: Defines and trains an example model
3_evaluate_code_location: Evaluates model performance and computes metrics
4_deploy_code_location: Packages and serves the trained model for production
shared_code_location: Utilities and code shared across locations

Example Pipeline Flow

Ingest: Loads example data from a Dagster resource in example_resources
Tokenize: Splits text into tokens using a shared tokenizer
Split: Splits data into train/test sets
Vocab Extraction: Extracts vocabulary from the train set for use by the model
Model: Trains a simple model using the extracted vocabulary
Evaluate: Evaluates the model on the test set
Deploy: Saves the model as a mock ONNX file using a Dagster resource for output location in example_resources

Dagster Concepts Demoed

Resources: Used for raw data input and output directory (see example_resources)
Assets: Each pipeline step is a Dagster asset
Definitions: Assets and resources are registered in Definitions objects

Development Notes

.egg-info folders: These are auto-generated by Python packaging tools (setuptools/pip) when installing code locations in editable mode. They contain metadata and can be safely ignored. Add *.egg-info/ to your .gitignore to avoid checking them into version control.
Test folders: Each code location has a corresponding test folder (e.g., 1_etl_tests/, 2_model_tests/, etc.) containing unit tests for that code location. Update or add tests in these folders as you develop your pipeline.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Dagster ML Multi-Code-Location Template

Table of Contents

Overview

Project Structure

Using This Template

Quickstart with Python CLI (Recommended)

Example usage

Mono-Repo vs Multi-Repo

Code Location Purposes

Example Pipeline Flow

Dagster Concepts Demoed

Development Notes

References

About

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.github		.github
1_etl_code_location		1_etl_code_location
2_model_code_location		2_model_code_location
3_evaluate_code_location		3_evaluate_code_location
4_deploy_code_location		4_deploy_code_location
screenshots		screenshots
scripts		scripts
shared_code_location		shared_code_location
.gitignore		.gitignore
LICENSE		LICENSE
README-CONVERT-TO-MULTI-REPO.md		README-CONVERT-TO-MULTI-REPO.md
README.md		README.md
dagster_cloud.yaml		dagster_cloud.yaml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
workspace.yaml		workspace.yaml

License

dominicfallows/dagster-ml-multi-code-location-template

Folders and files

Latest commit

History

Repository files navigation

Dagster ML Multi-Code-Location Template

Table of Contents

Overview

Project Structure

Using This Template

Quickstart with Python CLI (Recommended)

Example usage

Mono-Repo vs Multi-Repo

Code Location Purposes

Example Pipeline Flow

Dagster Concepts Demoed

Development Notes

References

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages