This repository is a GitHub Template for building modular, production-grade ML/data pipelines with Dagster using a multi-code-location architecture.
- Mono-repo default: This template assumes a mono-repo structure, but you can adapt it for a multi-repo setup (see below)
- To use: Click "Use this template" on GitHub, then customize code location names, project name, and assets as needed
Note: this is a work-in-progress template as I learn the Dagster framework. I've open-sourced this to gain feedback. Please feel free to create Issues and Pull Requests to improve this template. Thank you.
- Overview
- Project Structure
- Using This Template
- Code Location Purposes
- Mono-Repo vs Multi-Repo
- Quickstart with Python CLI (Recommended)
- Example Pipeline Flow
- Dagster Concepts Demoed
- References
This template demonstrates a modular Dagster project with each pipeline stage isolated in its own Python environment. The structure enables clear dependency management and scalable development. The example model is a decode-only Transformer, but you can adapt it for any ML/data workflow.
For more on Dagster concepts, see the Dagster Documentation.
- /dagster_cloud.yaml: Defines Dagster code locations for deployment (docs)
- /workspace.yaml: Configures local Dagster workspace and code location loading (docs)
- /shared_code_location/: Shared Python code (utilities, tokenization, I/O) importable by all code locations
- /example_resources/: Example input and output data for the pipeline
- /1_etl_code_location/: Data ingestion, tokenization, vocabulary extraction, and splitting (ETL pipeline)
- 1_etl/1_ingest/: Data ingestion assets (uses Dagster resource for raw data)
- 1_etl/2_tokenize/: Tokenization assets
- 1_etl/3_split_data/: Data splitting assets
- 1_etl/4_vocab_from_train_data/: Vocabulary extraction from train split
- /2_model_code_location/: Model definition and training
- /3_evaluate_code_location/: Model evaluation and metrics
- /4_deploy_code_location/: Model deployment, packaging, and serving (outputs to example_resources via resource)
Each code location contains:
assets.py
: Dagster assets for the locationdefinitions.py
: Dagster Definitions object for repository registrationrequirements.txt
,setup.py
,pyproject.toml
,setup.cfg
: Python packaging and dependencies
- Click "Use this template" on GitHub.
- Rename code locations and packages as needed (e.g.,
1_etl_code_location
→my_etl_code_location
,1_etl
→my_etl
). - Edit assets and definitions in each code location to fit your workflow.
- Update
workspace.yaml
anddagster_cloud.yaml
to match your code location names and structure. - (Optional) Split into multiple repos: See below for multi-repo setup instructions.
A Python-based CLI is provided in scripts/cli.py
to automate setup and common project tasks. The most common commands are:
python scripts/cli.py setup
— Set up all Python environments and install dependencies (Dagster installation docs)python scripts/cli.py clean
— Remove all virtual environments and Python cachespython scripts/cli.py dev
— Start the Dagster dev webserver for the whole projectpython scripts/cli.py dev <location>
— Start the dev webserver for a single code location (e.g.,python scripts/cli.py dev 1_etl_code_location
)python scripts/cli.py test
— Run tests in all code locationspython scripts/cli.py --help
— Show help and list available commands and options
python scripts/cli.py setup
source .venv/bin/activate
python scripts/cli.py dev # Loads all code locations as defined in workspace.yaml
See scripts/cli.py
for more commands and details.
- Mono-repo (default): All code locations live in one repository.
workspace.yaml
anddagster_cloud.yaml
reference local directories - Multi-repo: Each code location can be in a separate repository. Update
workspace.yaml
anddagster_cloud.yaml
to point to remote Python packages or Docker images
See README-CONVERT-TO-MULTI-REPO.md
for a step-by-step guide to splitting this template into multiple repositories.
- 1_etl_code_location: Data ingestion, preprocessing, tokenization, and splitting
- 2_model_code_location: Defines and trains an example model
- 3_evaluate_code_location: Evaluates model performance and computes metrics
- 4_deploy_code_location: Packages and serves the trained model for production
- shared_code_location: Utilities and code shared across locations
- Ingest: Loads example data from a Dagster resource in
example_resources
- Tokenize: Splits text into tokens using a shared tokenizer
- Split: Splits data into train/test sets
- Vocab Extraction: Extracts vocabulary from the train set for use by the model
- Model: Trains a simple model using the extracted vocabulary
- Evaluate: Evaluates the model on the test set
- Deploy: Saves the model as a mock ONNX file using a Dagster resource for output location in
example_resources
- Resources: Used for raw data input and output directory (see
example_resources
) - Assets: Each pipeline step is a Dagster asset
- Definitions: Assets and resources are registered in Definitions objects
- .egg-info folders: These are auto-generated by Python packaging tools (setuptools/pip) when installing code locations in editable mode. They contain metadata and can be safely ignored. Add
*.egg-info/
to your.gitignore
to avoid checking them into version control. - Test folders: Each code location has a corresponding test folder (e.g.,
1_etl_tests/
,2_model_tests/
, etc.) containing unit tests for that code location. Update or add tests in these folders as you develop your pipeline.