Releases: ai-dynamo/dynamo
Dynamo Release v0.3.2
Dynamo is a high-performance, low-latency inference framework designed to serve generative AI models—across any framework, architecture, or deployment scale. It's an open source project under the Apache 2.0 license. Dynamo is available for installation via pip wheels and containers from NVIDIA NGC.
As a vendor-neutral serving framework, Dynamo supports multiple large language model (LLM) inference engines to varying degrees:
- NVIDIA TensorRT-LLM
- vLLM
- SGLang
Major Features and Improvements
Engine Support and Routing
- Example standalone router for use outside of dynamo (#1409).
- The new SLA-based planner dynamically manages resource allocation based on service-level objectives (#1420).
- Data-parallel vLLM worker setups are now supported (#1513).
- SGLang support was extended for DeepEP deployments (#1120).
- Clean shutdown is now available for vllm_v1 and SGLang engines (#1562, #1764).
- Experimental support for WideEP with EPLB aggregation and disaggregation is now available for TRTLLM (#1652, #1690).
- Approximate KV cache residency and predicted active KV blocks for improved routing efficiency (#1636, #1638, #1731).
Observability and Metrics
- Native DCGM and Prometheus integration enables hardware metrics collection and export. Optional Grafana dashboards are provided (#1488, #1701, #1788).
- New Grafana dashboards offer composite software and hardware system visibility (#1788).
- Batch
/completions
endpoint and speculative decoding metrics are now supported for vLLM (#1626, #1549).
Deployment, Kubernetes, and CLI
- The Kubernetes operator now supports custom entrypoints, command overrides, and simplified graph deployments (#1396, #1708, #1877, #1893).
- Example manifests for multimodal and minimal deployments were added (#1836, #1872).
- Graph Helm chart logic, resource requests, and health probes were improved (#1877, #1888).
- Two new Helm charts are introduced in this release: dynamo-platform, and dynamo-crds, enabling modular and robust Kubernetes deployments for a variety of topologies and operational requirements.
- The
dynamo-run
command line interface now supports the--version
flag and improved error handling and validation (#1596, #1674, #1623). - Docker and Kubernetes deployment workflows were streamlined. Helm charts and container images were improved (#1742, #1796, #1840, #1841).
Developer Experience
- Embedding request handling was improved with frontend tokenization (#1494).
- OpenAI API request validation is now available (#1674).
- Batch embedding and parallel tokenization improve efficiency for batch inference and embedding (#1657).
- The
/responses
endpoint and additional API features were added (#1694).
Bug Fixes
- Issues related to GPU resource specifications in deployments, container builds, and runtime were fixed (#1826, #1792, #1546).
- Helm chart logic, resource requests, and health probes were corrected (#1877, #1893).
- Error handling and model loading were improved for multimodal and distributed deployments (#1545).
- Metrics publishing and logging were fixed for vLLM, SGLang, and OpenAI endpoints (#1864, #1649, #1639).
- Process cleanup issues were resolved in tests (#1801).
Documentation
- Documentation updates include new guides for Ray setup, architecture diagrams, and deployment modes (#1947, #1697).
- Benchmarking, troubleshooting, and advanced usage scenario documentation was enhanced.
- Notes were added to deprecate outdated connectors (#1964, #1959).
Build, CI, and Test
- Dependency upgrades include protobuf, nats, and etcd (#1876, #1744).
- CI coverage now includes GPU-based and multi-engine tests.
- Container builds now use distroless images for improved security and efficiency (#1570, #1569).
- Fault tolerance tests #1444
Known Issues
- KVBM is supported only with Python 3.12.
Release Assets
Python Wheels:
Rust Crates:
Containers:
- nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.3.2
- nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.3.2
- nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.3.2
- nvcr.io/nvidia/ai-dynamo/kubernetes-operator:0.3.2
Helm Charts:
Contributors
Thank you to all contributors for this release. For a full list, refer to the changelog.
Dynamo Release v0.3.1
Dynamo is an open source project under the Apache 2.0 license. The primary distribution is done through pip wheels with minimal binary size. The ai-dynamo GitHub organization hosts two repositories: Dynamo and NIXL. Dynamo is designed as the next-generation inference server, building upon the foundation of NVIDIA® Triton Inference Server™. While Triton focuses on single-node inference deployments, we're integrating its robust capabilities into Dynamo over the next several months. We'll maintain support for Triton while providing a clear migration path for existing users once Dynamo achieves feature parity.
As a vendor-neutral serving framework, Dynamo supports multiple large language model (LLM) inference engines to varying degrees:
- NVIDIA TensorRT-LLM
- vLLM
- SGLang
Dynamo v0.3.1 features:
- Functional DeepSeek R1 disaggregated serving with wide EP using SGLang
- Functional EPD disaggregation with video model (Llava video 7B)
- Proof of concept inference gateway support
- Prebuilt Dynamo + vLLM container
- We plan to release these pre-built containers in the coming days
- Amazon Linux support
Future plans
Dynamo Roadmap
Known Issues
- KVBM is supported only with python 3.12
What's Changed
🚀 Features & Improvements
- feat: expose estimated kv cache hit in dynamo-run by @tedzhouhk in #1246
- feat: KVBM async Python bindings and Layer class by @kthui in #1141
- feat: add critical task execution handle by @ryanolson in #1268
- feat: Initial Granite support by @grahamking in #1271
- feat: Restructure kv manager block registration by @jthomson04 in #1093
- feat: Publish events and metrics when using kv routing by @tanmayv25 in #1262
- feat(dynamo-run): Use llama.cpp as the default engine for GGUF by @grahamking in #1276
- feat: populate default image name by @biswapanda in #1255
- feat: flatten out dynamo cloud helm chart by @julienmancuso in #1258
- refactor: Refactor kv event publishers by @jthomson04 in #1287
- refactor: rename KvMetricsPublisher to WorkerMetricsPublisher by @alec-flowers in #1284
- feat: all blocks cleared event by @jain-ria in #1279
- perf: Create default sampling params only once during initialization by @krishung5 in #1294
- feat: expose router configurations to dynamo-run by @tedzhouhk in #1259
- feat: Make llama.cpp Gnu OpenMP dependency optional by @grahamking in #1331
- feat: set env variables in Dynamo deployments from secrets by @hhzhang16 in #1325
- feat: Add DSR1 configurations by @ptarasiewiczNV in #1298
- feat: add more metrics to rust frontend by @tedzhouhk in #1315
- feat: Enable disagg support in trtllm standalone script by @tanmayv25 in #1355
- feat: Integrate KVBM with
CriticalTaskHandle
by @jthomson04 in #1321 - feat: add implementation for embeddings by @t-ob in #1290
- feat: refactor docker registry secret management in operator by @julienmancuso in #1337
- feat: set model specific prompt templates in the multimodal config files, add documentation for multimodal example deployment by @hhzhang16 in #1366
- feat: add result of fluid experiment by @julienmancuso in #1379
- feat: Update container with better EFA/RDMA support by @aranadive in #1333
- feat: Support larger Gemma 3 models by @grahamking in #1359
- refactor: Rename CompletionRequest to NvCreateCompletionRequest by @paulhendricks in #1383
- feat: decouple bento dependency by @biswapanda in #1266
- feat: data synthesizer based on prefix statistics by @PeaBrane in #1087
- feat: introduce abstract classes to dynamo services by @mohammedabdulwahhab in #924
- feat: KVBM dynamo runtime + event manger by @oandreeva-nv in #1195
- feat: Utilities for distributed leader-worker barriers by @jthomson04 in #1429
- feat: Restructure the KVBM WriteTo trait by @jthomson04 in #1363
- feat: KVBM prometheus monitoring by @jthomson04 in #1211
- feat: Improved offload queueing and block eviction ordering by @jthomson04 in #1425
- feat: generate random texts from hashes using lorem ipsum by @PeaBrane in #1458
- refactor: use comment filed in annotated to pass metric-related information by @tedzhouhk in #1385
- feat: generalize VLM embedding extraction by @hhzhang16 in #1388
- refactor: move kv store to runtime by @ryanolson in #1459
- feat: add endpoint to clear all kv blocks in vllm v1 by @jain-ria in #1384
- feat: Video support with Dynamo by @indrajit96 in #1443
- feat: add build --push command by @hhzhang16 in #1485
- feat: FT downed worker instance tracking and skipping by @kthui in #1424
- feat: add dynamo pipeline example using inf-gw by @biswapanda in #1512
- refactor: Log subprocess stderr as WARN (#1563) by @rmccorm4 in #1574
🐛 Bug Fixes
- fix: cherry-pick of attributions from 0.2.1 release branch by @saturley-hall in #1267
- fix: resolve local dev container build issues by @t-ob in #1269
- fix: Renamed event publisher classes and configuration by @alec-flowers in #1273
- fix: Only check model name on etcd-registered endpoints by @jthomson04 in #1263
- fix: Fix mypy errors on trtllm examples by @tanmayv25 in #1277
- fix: remove sglang hash for pyproject by @ishandhanani in #1281
- fix: copy workspace as part of ci-min stage by @nv-anants in #1291
- fix: resources naming by @biswapanda in #1302
- fix: wait until probing on vllm examples to prevent timeouts by @mohammedabdulwahhab in #1293
- fix: Fix vllm v0 None*int error when not using kv aware router by @tedzhouhk in #1304
- fix: Update breaking change to enable_overlap_scheduler field from TRTLLM commit b4e5df0e by @rmccorm4 in #1310
- fix: make imagePullSecrets optional when installing dynamo cloud by @julienmancuso in #1324
- fix: Properly set VLLM_NIXL_SIDE_CHANNEL_HOST in multi-node by @ptarasiewiczNV in #1327
- fix: Allow building only llamacpp or only mistralrs engine. by @grahamking in #1328
- fix: allow custom annotations in api-store service by @julienmancuso in #1329
- fix: Flatten pytorch_backend_config section to address breaking change to trtllm config by @rmccorm4 in #1326
- fix: update profile script by @tedzhouhk in #1336
- fix: Use min of max tokens or context length by @abrarshivani in #1322
- fix: add ingress to llm example by @hhzhang16 in #1349
- fix(dynamo-run): For internal comms use a random endpoint instead of hard coded by @grahamking in #1335
- fix: dockerhub registry issues in dynamo operator by @mohammedabdulwahhab in #1350
- fix: add speculative decoding config to dynamo serve + trtllm by @richardhuo-nv in #1356
- fix: prefillqueue stream name in load-planner by @tedzhouhk in #1377
- fix: take into account number of workers from config by @julienmancuso in #1365
- fix: Fix link path for dynamo_run doc by @krishung5 in #1382
- fix: fix dynamo cloud helm chart by @julienmancuso in #1376
- fix: mismatch GAP and PA version by @tedzhouhk in #1386
- fix: remove unused arg in planner by @tedzhouhk in #1390
- fix: Use Ru...
Dynamo Release v0.3.0
Dynamo is an open source project under the Apache 2.0 license. The primary distribution is done through pip wheels with minimal binary size. The ai-dynamo GitHub organization hosts two repositories: Dynamo and NIXL. Dynamo is designed as the next-generation inference server, building upon the foundation of NVIDIA® Triton Inference Server™. While Triton focuses on single-node inference deployments, we're integrating its robust capabilities into Dynamo over the next several months. We'll maintain support for Triton while providing a clear migration path for existing users once Dynamo achieves feature parity.
As a vendor-neutral serving framework, Dynamo supports multiple large language model (LLM) inference engines to varying degrees:
- NVIDIA TensorRT-LLM
- vLLM
- SGLang
Dynamo v0.3.0 features:
- Dynamo run with KV routing and multiple model support! guide
- Vllm v1 engine support! example
- Sglang with DP attention! example
- SLA based planner! guide
- Optimized embedding transfer for multi-modal! example
- Dynamo deploy update command! guide
- Model caching using Fluid! guide
- Fluxcd guide to managing custom resources guide
Future plans
Dynamo Roadmap
Known Issues
- KVBM is supported only with python 3.12
What's Changed
🚀 Features & Improvements
- feat: kv block manager by @ryanolson in #965
- feat(sglang): disaggregated support by @ishandhanani in #976
- feat(dynamo-run): Print HTTP routes on startup by @grahamking in #1010
- feat(dynamo-run): KV-aware routing by @grahamking in #1064
- feat: KV Cache Manager block offloading by @jthomson04 in #1030
- feat: Add ignore_eos/nvext support for legacy completions by @rmccorm4 in #1080
- feat: Use existing Tokio runtime in components by @abrarshivani in #941
- feat: add vLLM V1 PD disagg example by @ptarasiewiczNV in #1013
- feat: Add OpenAI Embeddings interface in rust lib by @t-ob in #1110
- feat: add update deployment to dynamo deploy API and CLI by @hhzhang16 in #1048
- feat: KV Block Manager Python bindings by @kthui in #1022
- feat: Add LWS to Dynamo Operator by @nvrohanv in #998
- feat: Add support for SSD offloading in block manager by @jthomson04 in #1115
- feat: Support multiple models on single ingress node by @grahamking in #1127
- feat: adding outer dimension to isolate k/v blocks by @ryanolson in #1126
- feat: SLA Profiling and Recommending Parallelization Mapping by @tedzhouhk in #1114
- feat: vllm mock workers, Rusty skeleton by @PeaBrane in #1033
- feat: rename dynamo decorator by @biswapanda in #1133
- feat(dynamo-run): Allow setting context-length by @grahamking in #1157
- feat: Various KVBM improvements by @jthomson04 in #1134
- feat: Add TTFT and ITL Interpolation to Profiling Script by @tedzhouhk in #1159
- feat(dynamo-run): Allow setting KV cache block size by @grahamking in #1175
- feat: Add standalone script for TRTLLM integration into dynamo-run by @tanmayv25 in #1162
- feat: adding arena allocator for storage objects by @ryanolson in #1178
- feat: support k8s target in dynamo deploy command by @hhzhang16 in #1104
- feat: add dynamo operator overview doc by @julienmancuso in #688
- feat: add dynamo-run example for vllm v0 by @tedzhouhk in #1186
- feat: kvbm offload fixes and tests by @jthomson04 in #1191
- feat: Add metrics and event publishers by @tanmayv25 in #1192
- feat: NIXL Based RDMA Support w/ Multimodal Example by @whoisj in #1060
- feat: Add Hello World Multinode example by @kylehh in #624
- feat(sglang): add dockerfile/pyproject toml entry + steps to run dsr1 disagg by @ishandhanani in #1193
- feat(http): add health check endpoint by @ishandhanani in #1037
- feat: document model caching using Fluid by @julienmancuso in #1218
- feat: portable dynamo build by @biswapanda in #1215
- feat: fluxcd guide to managing custom resources by @mohammedabdulwahhab in #1220
- feat: Enable dynamo-run out=trtllm by @tanmayv25 in #1223
- feat(dynamo-llm): Remove bring-your-own-engine by @grahamking in #1216
- feat: remove bento cloud deploy target, set deployment target to kubernetes by default by @hhzhang16 in #1247
- feat: Support OAI frontend format and add async image handing by @krishung5 in #1214
- feat: add KV Event Publishing to vLLM v1 by @alec-flowers in #1181
🐛 Bug Fixes
- fix(bindings): serve_endpoint no longer takes a lease by @grahamking in #1014
- fix(deps): sglang install must be done manually by @ishandhanani in #1019
- fix: dynamo_serve and scv config inject/get by @tedzhouhk in #1017
- fix: pin click dependency to old releases by @nv-anants in #1042
- fix: use correct lease id for kv router by @tedzhouhk in #1035
- fix: update nixl setup for arm builds by @nv-anants in #1061
- fix: downgrade CUDA image use to work around PyNccl timeout in vLLM Ray use case by @GuanLuo in #1065
- fix: read 'workers' to set deployments 'replicas' by @julienmancuso in #1040
- fix: add maxage to nats stream by @wxsms in #1053
- fix: fix broken links in deployment docs by @biswapanda in #1084
- fix: Fix default RouterMode value by @grahamking in #1092
- fix: planner fixes by @mohammedabdulwahhab in #1055
- fix: use resource and workers hints from decorators and service args by @biswapanda in #1044
- fix: add planner path in devcontainer by @biswapanda in #1113
- fix: remove lib.real from LD_LIBRARY_PATH by @alec-flowers in #1117
- fix(sglang): allow for
disaggregation_bootstrap_port
for multinode deployment by @ishandhanani in #1119 - fix: Disable block manager by default in Python bindings by @kthui in #1128
- fix: Incrementally decode token to reduce the overhead from Processor by @tanmayv25 in #1129
- fix: set gpus as strings in config files by @julienmancuso in #1123
- fix: Fix the protocol in the example by @tanmayv25 in #1146
- fix: register model after engine load by @nnshah1 in #1145
- fix: make component type a simple string by @mohammedabdulwahhab in #1144
- fix(llmctl): Use ModelWatcher instead of direct etcd operations by @grahamking in #1150
- fix(dynamo-run): Don't exit interactive chat on error by @grahamking in #1155
- fix(llmctl): Add back the model_type in remove by @grahamking in #1158
- fix: Enable Dynamo HTTP servers to run on IPv6-only hosts by @jmswen in #1166
- fix: typo in planner doc and log by @tedzhouhk in #1165
- fix: Fix race condition in kv_router unit test by @grahamk...
Dynamo Release v0.2.1
Dynamo is an open source project with Apache 2 license. The primary distribution is done via pip wheels with minimal binary size. The ai-dynamo github org hosts 2 repos: dynamo and NIXL. Dynamo is designed as the ideal next generation inference server, building upon the foundations of the Triton Inference Server. While Triton focuses on single-node inference deployments, we are committed to integrating its robust single-node capabilities into Dynamo within the next several months. We will maintain ongoing support for Triton while ensuring a seamless migration path for existing users to Dynamo once feature parity is achieved. As a vendor-agnostic serving framework, Dynamo supports multiple LLM inference engines including TRT-LLM, vLLM, and SGLang, with varying degrees of maturity and support.
Dynamo v0.2.1 features:
- KV Block Manager! intro
- Improved vLLM Performance by avoiding re-initializing sampling params
- SGLang support! README.md
- Multi-Modal E/P/D Disaggregation! README.md
- Leader Worker Set K8s!
- Qwen3, Gemma3 and Llama4 in Dynamo Run!
Future plans
Known Issues
- Benchmark guides are still being validated on public cloud instances (GCP / AWS)
What's Changed
🚀 Features & Improvements
- feat: Qwen3, Gemma3 and Llama4 support by @grahamking in #1002
- feat: Remove vllm and sglang from cargo build command by @hhzhang16 in #1003
- feat: deploy planner in operator by @julienmancuso in #921
- refactor: use primary lease + self-contained graceful shutdown trigged by SIGINT/SIGTERM by @tedzhouhk in #1001
- feat: Add AWS EFA support by @aranadive in #999
- feat(sglang): aggregated support by @ishandhanani in #937
- feat: decoupling dynamo serve by @biswapanda in #905
- feat: allow adding auth to etcd by @wxsms in #980
🐛 Bug Fixes
- fix: Extract tokenizer from GGUF for Qwen3 and Gemma3 arch by @grahamking in #1011
Other Changes
- docs: add docs for dynamo build by @mohammedabdulwahhab in #714
- docs: fix typo in disagg perf tuning guide by @tedzhouhk in #859
- feat: Adding completions endpoint support to
dynamo run in=http
by @oandreeva-nv in #777 - docs: update editable install to include planner by @nv-anants in #860
- chore: add docs around how runtime reconfiguration works by @ishandhanani in #861
- feat: replace async queue with async iter and double decorator by @biswapanda in #858
- docs: fix typo in planner documentation by @AndyDai-nv in #864
- feat: Add unified x86 / aarch64 (ARM) build for VLLM image by @rmccorm4 in #839
- refactor: move logging config to runtime by @ishandhanani in #863
- feat: support multiple endpoints by @biswapanda in #857
- build: Add Olga as a Rust reviewer by @grahamking in #872
- fix: change the processor number to 5 to reduce the tokenization bottleneck by @richardhuo-nv in #865
- refactor: change trtllm example kv routing use python bindings | deal with trtllm partial blocks | trtllm event change by @ziqif-nv in #866
- fix: change environment variable to support local mount by @nnshah1 in #885
- fix: manylinux tag in ai-dynamo-vllm wheel by @nv-anants in #884
- chore: Split PushRouter from Client by @grahamking in #817
- chore: add fastapi depenedncy in pyproject.toml by @biswapanda in #888
- docs: update pythonpath for starting planner by @tedzhouhk in #890
- fix(http): Make ModelDeploymentCard optional by @grahamking in #891
- feat: Add request template support for default inference parameters by @abrarshivani in #841
- fix: endless map in nixl.py by @wxsms in #852
- feat: remove dynamoComponentRequest CRD by @julienmancuso in #856
- docs: Fixes to dynamo deploy docs by @mohammedabdulwahhab in #902
- feat: label component CR for planner by @julienmancuso in #901
- feat: allow users to add env vars to dynamo deployment by @hhzhang16 in #862
- chore: unified logging, added informative warnings for KV router example by @PeaBrane in #912
- docs: add an example on how to use
--service-name
flag to spin up a standalone service by @ishandhanani in #915 - fix: trtllm example by @biswapanda in #909
- fix: add dedicated llmapi config for trtllm disagg kv routing example by @ziqif-nv in #916
- chore: reduce code repetition in processor by @PeaBrane in #919
- feat: Support hf:// URLs in dynamo run by @abrarshivani in #917
- feat: Add check for version info in container build script by @abrarshivani in #774
- docs: update examples in document by @biswapanda in #897
- chore(dynamo-llm): Move the pre-processor to ingress side by @grahamking in #903
- fix: default docker username and password are empty by @hhzhang16 in #926
- feat: Add multimodal example with aggregated serving by @krishung5 in #709
- docs: Add multi-node TRTLLM steps to README by @rmccorm4 in #930
- feat: Update to support completion endpoint in TRTLLM by @tanmayv25 in #837
- fix: use primary lease for NixlMetadataStore by @tedzhouhk in #928
- chore: merge in support matrix and nixl commit hash by @saturley-hall in #944
- feat: allow to set http port by @julienmancuso in #931
- feat: automatically reserve port for assigning port number to endpoint and pubsub by @richardhuo-nv in #946
- feat: multi-thread (via asyncio.task) in processor by @tedzhouhk in #904
- fix: remove requirement for istio in doc by @julienmancuso in #950
- feat: dynamo-run <-> python interop by @grahamking in #934
- refactor: refactor dynamo deploy subfolder by @hhzhang16 in #927
- ci: lock cuda at 12.8 by @hhzhang16 in #957
- chore: Two-line copyright check by @grahamking in #958
- chore: Add John as Codeowner by @jthomson04 in #962
- feat(dynamo-run): vllm and sglang subprocess engines by @grahamking in #954
- docs: add drt doc by @tedzhouhk in #951
- feat: Migrate NATS Queue to Rust (#669) by @jthomson04 in #961
- fix: create k8s service for main component only by @julienmancuso in #953
- fix: fix missing num_remote_prefill_groups in vLLM patch by @ptarasiewiczNV in #981
- fix: Create default sampling params only once during initialization by @ptarasiewiczNV in #982
- chore: Remove embedded Python vllm and sglang engines by @grahamking in #966
- fix: increase ulimit nofile for container/run.sh by @ajcasagrande in #969
- docs: add fix for Zsh globbing error with
pip install .[all]
by @Chasing1020 in #945 - build: Cleans the TensorRTLLM + Dynamo container build by @tanmayv25 in #968
- feat: add interface for deployment manager by @biswapanda in #987
- fix: Check nvext for ignore_eos and set min_tokens for benchmark consistency by @rmccorm4 in #988
- fix: Fix vllm/sglang engine model name if using HF repo by @grahamking in #986
- feat: Add multimodal example with disaggregated serving by @krishung5 in #811
- feat: clea...
Dynamo Release v0.2.0
Dynamo is an open source project with Apache 2 license. The primary distribution is done via pip wheels with minimal binary size. The ai-dynamo github org hosts 2 repos: dynamo and NIXL. Dynamo is designed as the ideal next generation inference server, building upon the foundations of the Triton Inference Server. While Triton focuses on single-node inference deployments, we are committed to integrating its robust single-node capabilities into Dynamo within the next several months. We will maintain ongoing support for Triton while ensuring a seamless migration path for existing users to Dynamo once feature parity is achieved. As a vendor-agnostic serving framework, Dynamo supports multiple LLM inference engines including TRT-LLM, vLLM, and SGLang, with varying degrees of maturity and support.
Dynamo v0.2.0 features:
- GB200 support with ARM builds (Note: currently requires a container build)
- Planner - new experimental support for spinning workers up and down based on load
- Improved K8s deployment workflow
- Installation wizard to enable easy configuration of Dynamo on your Kubernetes cluster
- CLI to manage your operator-based deployments
- Consolidate Custom Resources for Dynamo Deployments
- Documentation improvements (including Minikube guide to installing Dynamo Platform)
Future plans
Known Issues
- Benchmark guides are still being validated on public cloud instances (GCP / AWS)
- Benchmarks on internal clusters show a 15% degradation from results displayed in summary graphs for multi-node 70B and are being investigated.
- TensorRT-LLM examples are not working currently in this release - but are being fixed in main.
What's Changed
- fix: fix max_local_prefill_length not being printed out in disagg router log by @tedzhouhk in #628
- docs: Add instructions to install git lfs by @tanmayv25 in #627
- fix: add DYNAMO_HOME env var to vLLM docker image by @nv-anants in #629
- fix: Account for Metrics.decode() changes by @rmccorm4 in #619
- fix: Update test_report by @pvijayakrish in #641
- fix: serviceArgs in config was not getting set for workers by @mohammedabdulwahhab in #640
- fix: adding conversion to string for notif id comparison by @nnshah1 in #638
- docs: Add documentation for UCX KV cache transfer in TRTLLM by @tanmayv25 in #639
- build: Define UCX env var to use NVLink when available by @tanmayv25 in #631
- feat: ETCD prefix watcher + python binding + runtime reconfiguration for router and disagg router by @tedzhouhk in #581
- fix: dynamo build should work with link syntax by @mohammedabdulwahhab in #646
- fix: change trtllm kv_router default block_size to 32 by @ziqif-nv in #642
- fix: signal handlers to clean up zombie vllm processes by @ishandhanani in #545
- feat: add .devcontainer based off images in container/ by @alec-flowers in #497
- fix: devcontainer mounts and vllm c api by @alec-flowers in #663
- fix: deploy command should support passing config by @mohammedabdulwahhab in #626
- feat(dynamo-run): improve available engines list in --help by @XueSongTap in #664
- feat: add dynamoDeployment CR finalizer by @julienmancuso in #623
- fix: set correct parent_hash for each kv block when publish kv events by @ziqif-nv in #671
- docs: Use the same term for dynamo base image across code snippets and text by @hutm in #670
- docs: move deploy docs to docs/guides by @hhzhang16 in #674
- fix: frontend and http server signal handling by @alec-flowers in #677
- fix: check for resource in pipeline helm chart by @julienmancuso in #687
- fix: ensure
VLLM_LOGGING_LEVEL=xyz
followsDYN_LOG=xyz
by @ishandhanani in #692 - feat: replace dynamo server with dynamo cloud by @hhzhang16 in #696
- feat: base Dynamo docker image improvements and fixes by @hhzhang16 in #658
- fix: fix pipeline helm chart by @julienmancuso in #698
- docs: Benchmarking guide updates by @kthui in #678
- feat: bump vLLM version to v0.8.4 by @ptarasiewiczNV in #690
- chore: Replace TRD->Dynamo in llmctl help output by @rmccorm4 in #710
- fix: allow for an empty dynamo config file by @hhzhang16 in #712
- fix: cli version by @ishandhanani in #716
- docs: Remove outdated python-wheels directory reference by @rmccorm4 in #719
- fix: direct clients vs dependancies by @ishandhanani in #704
- feat: adding dynamo-tokens crate by @ryanolson in #718
- fix: bump GAP to r25.03 by @tedzhouhk in #724
- feat: make ingress configurable in operator by @julienmancuso in #717
- feat: configure logger with detail info by @tlipoca9 in #654
- feat: Add disagg skeleton example by @kylehh in #683
- fix: dynamo deploy helm chart cleanup by @mohammedabdulwahhab in #727
- docs: add dedicated minikube guide by @mohammedabdulwahhab in #735
- feat(dynamo-engine-vllm): vllm 0.8.X support by @grahamking in #728
- feat: gracefully shutdown endpoint by revoking etcd lease + python binding by @tedzhouhk in #730
- fix: Add missing deps for '--framework none' build by @rmccorm4 in #738
- chore: Remove TRT-LLM C++ engine in favor of Python one by @grahamking in #747
- docs: Support matrix post release. by @pvijayakrish in #736
- docs: add aggregated deployment guide for multi-node sized model by @GuanLuo in #713
- feat: make the model name to be the same as the HF repo name for dynamo-run by @AndyDai-nv in #749
- feat: add additional packages to log filters by @abrarshivani in #752
- chore(dynamo-run): Fix echo_core for EOS tokens by @grahamking in #759
- feat: add custom lease to worker components by @ishandhanani in #748
- chore: Add roadmap to main README.md by @harryskim in #763
- feat: MLA disaggregation support to vLLM patch by @ptarasiewiczNV in #745
- fix: Fix cancellation flow in python component graph by @pankajroark in #765
- fix: give the user ownership permissions of /opt/dynamo/venv by @hhzhang16 in #767
- docs: deployment docs improvements by @hhzhang16 in #753
- feat: add option to configure separate docker registry for pipelines docker images by @julienmancuso in #744
- chore: Update bug report to use dynamo env for collecting environment information by @nv-tusharma in #558
- docs: R1 disaggregation guide by @GuanLuo in #720
- feat: allow to CRUD dynamo pipelines by @julienmancuso in #761
- docs: Custom Backend/Worker Guide by @rmccorm4 in #608
- chore: fix arg name in example by @CormickKneey in #770
- build: add rust binaries in manylinux image by @nv-anants in #783
- feat: remove bento/yatai references by @julienmancuso in #782
- docs: add note to use release branch examples by @nv-anants in #793
- feat: Add log verbosity level flag to dynamo-run cli by @abrarshivani in #780
- feat: rename operator CRDs by @julienmancuso in #795
- feat: Add linux aarch64 support to dynamo-run build by @rmccorm4 in #802
- fix: Update TRTLLM version and fix disagg workflow by @tanmayv25 in #804
- chore: Increase sleep tim...
Dynamo Release v0.1.1
Dynamo is an open source project with Apache 2 license. The primary distribution is done via pip wheels with minimal binary size. The ai-dynamo github org hosts 2 repos: dynamo and NIXL. Dynamo is designed as the ideal next generation inference server, building upon the foundations of the Triton Inference Server. While Triton focuses on single-node inference deployments, we are committed to integrating its robust single-node capabilities into Dynamo within the next several months. We will maintain ongoing support for Triton while ensuring a seamless migration path for existing users to Dynamo once feature parity is achieved. As a vendor-agnostic serving framework, Dynamo supports multiple LLM inference engines including TRT-LLM, vLLM, and SGLang, with varying degrees of maturity and support.
Dynamo v0.1.1 features:
- Benchmarking guides for Single and Multi-Node Disaggregation on H100 (vLLM)
- TensorRT-LLM support for KV Aware Routing
- TensorRT-LLM support for Disaggregation
- ManyLinux and Ubuntu 22.04 Support for wheels and crates
- Unified logging for Python and Rust
Future plans
- Instructions for reproducing benchmark guides on GCP and AWS
- KV Cache Manager as a standalone repository under the ai-dynamo organization. This release will provide functionality for storing and evicting KV cache across multiple memory tiers, including GPU, system memory, local SSD, and object storage.
- Searchable user guides and documentation
- Multi-node instances for large models
- Initial Planner version supporting dynamic scaling of P / D workers. We will include an early version of the Dynamo Planner, another core component. This initial release will feature heuristic-based dynamic allocation of GPU workers between prefill and decode tasks, as well as model and fleet configuration adjustments based on user traffic patterns. Our vision is to evolve the Planner into a reinforcement learning platform, which will allow users to define objectives and then tune and optimize performance policies automatically based on system feedback.
- vLLM 1.0 support with NIXL and KV Cache Events
Known Issues
- Benchmark guides are still being validated on public cloud instances (GCP / AWS)
- Benchmarks on internal clusters show a 15% degradation from results displayed in summary graphs for multi-node 70B and are being investigated.
What's Changed
- docs: Benchmarking guide updates (#678) by @kthui in #699
- docs: Update support matrix by @pvijayakrish in #691
- fix: change trtllm kv_router default block_size to 32 (#642) by @tanmayv25 in #694
- fix: set correct parent_hash for each kv block when publish kv events by @tanmayv25 in #693
- fix: Remove kv connector from agg config by @ptarasiewiczNV in #655
- fix: Account for Metrics.decode() changes (#619) by @rmccorm4 in #619
- fix: update to match latest nixl notifications as bytes @nnshah1 in #645
- docs: Update support matrix by @pvijayakrish in #633
- docs: Add instructions to install git lfs (#627) by @tanmayv25 in #627
- fix: add DYNAMO_HOME env var to vLLM docker image (#629) by @nv-anants in #629
- feat: TRT-LLM disaggregated serving using UCX (#562) by @tanmayv25 in #562
- docs: Update support matrix by @pvijayakrish in #604
- docs: Guide for multi-node benchmarking (#561) by @kthui in #561
- fix: remove api-store from container by @mohammedabdulwahhab in #617
- docs: Guides for single node benchmarking (#509) by @kthui in #509
- fix: set worker env before worker process spawn by @ishandhanani in #614
- docs: Move trtllm dynamo run doc from example to dynamo run guide (#578) by @tanmayv25 in #578
- chore: update ai-dynamo-vllm wheel version (#598) by @nv-anants in #598
- fix: bump bento to 1.4.8 (#579) by @mohammedabdulwahhab in #579
- fix: update yum install in wheel-builder image (#605) by @nv-anants in #605
- docs: update dynamo serve trtllm agg example yaml files (#600) by @ziqif-nv in #600
- chore: use latest nixl for docker builds by @nv-anants in #596
- chore: update versions to 0.1.1 by @nv-anants in #552
- docs: Updated dynamo run instructions by @cdgamarose-nv in #555
- feat: Add manylinux support for Dynamo by @pvijayakrish in #536
- docs: Clarify the --max-local-prefill-length help description by @kthui in #554
- feat: Add dynamo env CLI option to provide information about user environment by @nv-tusharma in #533
- docs: add disagg tuning guide by @tedzhouhk in #413
- fix: let dynamo run pass --help to dynamo-run by @ziqif-nv in #547
- chore: Update TRTLLM version. Fix router. by @tanmayv25 in #527
- fix: unify and enable dynamo logging by @ishandhanani in #520
- feat(dynamo-run): Basic routing choice by @grahamking in #524
- fix: clean unused bento pieces from serve.py and serving.py by @ishandhanani in #532
- docs: update close-deployment in dynamo_serve.md by @tlipoca9 in #535
- feat: update operator README by @julienmancuso in #544
- fix: mypy error by @ishandhanani in #543
- feat: cleanup operator code by @julienmancuso in #529
- chore: Fixed file headers. Added attributions. by @dmitry-tokarev-nv in #530
- fix: Remove api-server code by @mohammedabdulwahhab in #526
- docs: hello world and vllm process docs by @ishandhanani in #525
- feat: KV recorder for dumping router events into a jsonl by @PeaBrane in #505
- chore: cleaner required workers check (don't spam print) by @PeaBrane in #521
- docs: dynamo-run clarify engine list by @grahamking in #522
- chore: Upgrade Rust to 1.86 by @grahamking in #518
- chore: Add devops in more CODEOWNERS by @grahamking in #512
- feat: Python decorator dynamo_worker takes optional
static
parameter without etcd by @grahamking in #494 - fix: broken link to dynamo run by @lkm2835 in #517
- docs: add 405b disaggregated serving documentation by @ishandhanani in #496
- refactor: migrate engines to standalone crates by @ryanolson in #453
- feat: Add TensorRT-LLM example for dynamo serve/run by @tanmayv25 in #456
- docs: Remove invalid link by @grahamking in #506
- docs: add instruction to copy dynamo-run in container setup by @hanweisen in #508
- chore: Add libclang-dev to CI for llamacpp by @grahamking in #507
- chore: rename duration to timeout by @tlipoca9 in #503
- fix: adding missing file by @ryanolson in #501
- feat: allow replicas to be set in DynamoDeployment CR by @julienmancuso in #486
- chore: Disable blank issue creation for default issues template by @nv-tusharma in #492
- chore: Remove <> from title + add labels for default issues template. by @nv-tusharma in #491
- feat: Sets the code of conduct for the repository by @saturley-hall in #454
- fix: Consolidate dynamo start and dynamo serve commands by @mohammedabdulwahhab in #405
- feat: improve serve commands and expose
DYNAMO_HOME
env var by @jon-chuang in #436 - feat: kv aware router executable by @ryanolson in #399
- feat: deploy and use buildkit to build dynamo images by @julienmancuso in #450
- feat(serve): Enhance multi-node deployment and worker configuration by @ishandhanani in #457
- chore: Add default issue template for bug & feature requests by @nv-tusharma in #471
- feat: unified logging by @ryanolson in #472
- feat: add devcontainer to dynamo for Ubuntu 24.04 use by @h...
Dynamo Release v0.1.0
Dynamo v0.1.0 version will be released following Jensen Huang’s GTC keynote, and the product will be hosted on github.com/ai-dynamo. It’s an open source project with Apache 2 license, and public continuous integration will be available from the start to enable industry-wide collaboration. The primary distribution will be through pip wheels with minimal binary size. The ai-dynamo github org will host 2 repos: dynamo and NIXL.
Initial Dynamo release features:
- Disaggregated serving with X prefill and Y decode nodes
- KV aware routing
- KV cache manager to offload KV cache to system memory
- NIXL support for RDMA (InfiniBand, Ethernet) and TCP
- Support for K8s deployment
As a vendor-agnostic serving framework, Dynamo supports multiple LLM inference engines including TRT-LLM, vLLM, and SGLang at launch, with varying degrees of maturity and support. Dynamo supports the vLLM engine with all the capabilities mentioned above, with a plan to achieve feature parity with the rest of inference engines as soon as possible.
Future plans
The next release of Dynamo plans to open-source the KV cache manager as a standalone repository under the ai-dynamo organization. This release will provide functionality for storing and evicting KV cache across multiple memory tiers, including GPU, system memory, local SSD, and object storage.
In that release, we will include an early version of the Dynamo Planner, another core component. This initial release will feature heuristic-based dynamic allocation of GPU workers between prefill and decode tasks, as well as model and fleet configuration adjustments based on user traffic patterns. Our vision is to evolve the Planner into a reinforcement learning platform, which will allow users to define objectives and then tune and optimize performance policies automatically based on system feedback.
Dynamo is designed as the ideal next generation inference server, building upon the foundations of the Triton Inference Server. While Triton focuses on single-node inference deployments, we are committed to integrating its robust single-node capabilities into Dynamo within the next several months. We will maintain ongoing support for Triton while ensuring a seamless migration path for existing users to Dynamo once feature parity is achieved.