Skip to content

Releases: ai-dynamo/dynamo

Dynamo Release v0.3.2

18 Jul 05:21
50f3636
Compare
Choose a tag to compare

Dynamo is a high-performance, low-latency inference framework designed to serve generative AI models—across any framework, architecture, or deployment scale. It's an open source project under the Apache 2.0 license. Dynamo is available for installation via pip wheels and containers from NVIDIA NGC.

As a vendor-neutral serving framework, Dynamo supports multiple large language model (LLM) inference engines to varying degrees:

  • NVIDIA TensorRT-LLM
  • vLLM
  • SGLang

Major Features and Improvements

Engine Support and Routing

  • Example standalone router for use outside of dynamo (#1409).
  • The new SLA-based planner dynamically manages resource allocation based on service-level objectives (#1420).
  • Data-parallel vLLM worker setups are now supported (#1513).
  • SGLang support was extended for DeepEP deployments (#1120).
  • Clean shutdown is now available for vllm_v1 and SGLang engines (#1562, #1764).
  • Experimental support for WideEP with EPLB aggregation and disaggregation is now available for TRTLLM (#1652, #1690).
  • Approximate KV cache residency and predicted active KV blocks for improved routing efficiency (#1636, #1638, #1731).

Observability and Metrics

  • Native DCGM and Prometheus integration enables hardware metrics collection and export. Optional Grafana dashboards are provided (#1488, #1701, #1788).
  • New Grafana dashboards offer composite software and hardware system visibility (#1788).
  • Batch /completions endpoint and speculative decoding metrics are now supported for vLLM (#1626, #1549).

Deployment, Kubernetes, and CLI

  • The Kubernetes operator now supports custom entrypoints, command overrides, and simplified graph deployments (#1396, #1708, #1877, #1893).
  • Example manifests for multimodal and minimal deployments were added (#1836, #1872).
  • Graph Helm chart logic, resource requests, and health probes were improved (#1877, #1888).
  • Two new Helm charts are introduced in this release: dynamo-platform, and dynamo-crds, enabling modular and robust Kubernetes deployments for a variety of topologies and operational requirements.
  • The dynamo-run command line interface now supports the --version flag and improved error handling and validation (#1596, #1674, #1623).
  • Docker and Kubernetes deployment workflows were streamlined. Helm charts and container images were improved (#1742, #1796, #1840, #1841).

Developer Experience

  • Embedding request handling was improved with frontend tokenization (#1494).
  • OpenAI API request validation is now available (#1674).
  • Batch embedding and parallel tokenization improve efficiency for batch inference and embedding (#1657).
  • The /responses endpoint and additional API features were added (#1694).

Bug Fixes

  • Issues related to GPU resource specifications in deployments, container builds, and runtime were fixed (#1826, #1792, #1546).
  • Helm chart logic, resource requests, and health probes were corrected (#1877, #1893).
  • Error handling and model loading were improved for multimodal and distributed deployments (#1545).
  • Metrics publishing and logging were fixed for vLLM, SGLang, and OpenAI endpoints (#1864, #1649, #1639).
  • Process cleanup issues were resolved in tests (#1801).

Documentation

  • Documentation updates include new guides for Ray setup, architecture diagrams, and deployment modes (#1947, #1697).
  • Benchmarking, troubleshooting, and advanced usage scenario documentation was enhanced.
  • Notes were added to deprecate outdated connectors (#1964, #1959).

Build, CI, and Test

  • Dependency upgrades include protobuf, nats, and etcd (#1876, #1744).
  • CI coverage now includes GPU-based and multi-engine tests.
  • Container builds now use distroless images for improved security and efficiency (#1570, #1569).
  • Fault tolerance tests #1444

Known Issues

  • KVBM is supported only with Python 3.12.

Release Assets

Python Wheels:

Rust Crates:

Containers:

Helm Charts:

Contributors

Thank you to all contributors for this release. For a full list, refer to the changelog.

Dynamo Release v0.3.1

01 Jul 17:59
e117295
Compare
Choose a tag to compare

Dynamo is an open source project under the Apache 2.0 license. The primary distribution is done through pip wheels with minimal binary size. The ai-dynamo GitHub organization hosts two repositories: Dynamo and NIXL. Dynamo is designed as the next-generation inference server, building upon the foundation of NVIDIA® Triton Inference Server™. While Triton focuses on single-node inference deployments, we're integrating its robust capabilities into Dynamo over the next several months. We'll maintain support for Triton while providing a clear migration path for existing users once Dynamo achieves feature parity.

As a vendor-neutral serving framework, Dynamo supports multiple large language model (LLM) inference engines to varying degrees:

  • NVIDIA TensorRT-LLM
  • vLLM
  • SGLang

Dynamo v0.3.1 features:

  • Functional DeepSeek R1 disaggregated serving with wide EP using SGLang
  • Functional EPD disaggregation with video model (Llava video 7B)
  • Proof of concept inference gateway support
  • Prebuilt Dynamo + vLLM container
    • We plan to release these pre-built containers in the coming days
  • Amazon Linux support

Future plans
Dynamo Roadmap

Known Issues

  • KVBM is supported only with python 3.12

What's Changed

🚀 Features & Improvements

🐛 Bug Fixes

Read more

Dynamo Release v0.3.0

05 Jun 20:51
15ca948
Compare
Choose a tag to compare

Dynamo is an open source project under the Apache 2.0 license. The primary distribution is done through pip wheels with minimal binary size. The ai-dynamo GitHub organization hosts two repositories: Dynamo and NIXL. Dynamo is designed as the next-generation inference server, building upon the foundation of NVIDIA® Triton Inference Server™. While Triton focuses on single-node inference deployments, we're integrating its robust capabilities into Dynamo over the next several months. We'll maintain support for Triton while providing a clear migration path for existing users once Dynamo achieves feature parity.

As a vendor-neutral serving framework, Dynamo supports multiple large language model (LLM) inference engines to varying degrees:

  • NVIDIA TensorRT-LLM
  • vLLM
  • SGLang

Dynamo v0.3.0 features:

  • Dynamo run with KV routing and multiple model support! guide
  • Vllm v1 engine support! example
  • Sglang with DP attention! example
  • SLA based planner! guide
  • Optimized embedding transfer for multi-modal! example
  • Dynamo deploy update command! guide
  • Model caching using Fluid! guide
  • Fluxcd guide to managing custom resources guide

Future plans
Dynamo Roadmap

Known Issues

  • KVBM is supported only with python 3.12

What's Changed

🚀 Features & Improvements

🐛 Bug Fixes

Read more

Dynamo Release v0.2.1

22 May 23:45
b950ec5
Compare
Choose a tag to compare

Dynamo is an open source project with Apache 2 license. The primary distribution is done via pip wheels with minimal binary size. The ai-dynamo github org hosts 2 repos: dynamo and NIXL. Dynamo is designed as the ideal next generation inference server, building upon the foundations of the Triton Inference Server. While Triton focuses on single-node inference deployments, we are committed to integrating its robust single-node capabilities into Dynamo within the next several months. We will maintain ongoing support for Triton while ensuring a seamless migration path for existing users to Dynamo once feature parity is achieved. As a vendor-agnostic serving framework, Dynamo supports multiple LLM inference engines including TRT-LLM, vLLM, and SGLang, with varying degrees of maturity and support.

Dynamo v0.2.1 features:

  • KV Block Manager! intro
  • Improved vLLM Performance by avoiding re-initializing sampling params
  • SGLang support! README.md
  • Multi-Modal E/P/D Disaggregation! README.md
  • Leader Worker Set K8s!
  • Qwen3, Gemma3 and Llama4 in Dynamo Run!

Future plans

Dynamo Roadmap

Known Issues

  • Benchmark guides are still being validated on public cloud instances (GCP / AWS)

What's Changed

🚀 Features & Improvements

🐛 Bug Fixes

  • fix: Extract tokenizer from GGUF for Qwen3 and Gemma3 arch by @grahamking in #1011

Other Changes

Read more

Dynamo Release v0.2.0

01 May 00:33
ca728f6
Compare
Choose a tag to compare

Dynamo is an open source project with Apache 2 license. The primary distribution is done via pip wheels with minimal binary size. The ai-dynamo github org hosts 2 repos: dynamo and NIXL. Dynamo is designed as the ideal next generation inference server, building upon the foundations of the Triton Inference Server. While Triton focuses on single-node inference deployments, we are committed to integrating its robust single-node capabilities into Dynamo within the next several months. We will maintain ongoing support for Triton while ensuring a seamless migration path for existing users to Dynamo once feature parity is achieved. As a vendor-agnostic serving framework, Dynamo supports multiple LLM inference engines including TRT-LLM, vLLM, and SGLang, with varying degrees of maturity and support.

Dynamo v0.2.0 features:

  • GB200 support with ARM builds (Note: currently requires a container build)
  • Planner - new experimental support for spinning workers up and down based on load
  • Improved K8s deployment workflow
    • Installation wizard to enable easy configuration of Dynamo on your Kubernetes cluster
    • CLI to manage your operator-based deployments
    • Consolidate Custom Resources for Dynamo Deployments
    • Documentation improvements (including Minikube guide to installing Dynamo Platform)

Future plans

Dynamo Roadmap

Known Issues

  • Benchmark guides are still being validated on public cloud instances (GCP / AWS)
  • Benchmarks on internal clusters show a 15% degradation from results displayed in summary graphs for multi-node 70B and are being investigated.
  • TensorRT-LLM examples are not working currently in this release - but are being fixed in main.

What's Changed

Read more

Dynamo Release v0.1.1

16 Apr 20:44
926370b
Compare
Choose a tag to compare

Dynamo is an open source project with Apache 2 license. The primary distribution is done via pip wheels with minimal binary size. The ai-dynamo github org hosts 2 repos: dynamo and NIXL. Dynamo is designed as the ideal next generation inference server, building upon the foundations of the Triton Inference Server. While Triton focuses on single-node inference deployments, we are committed to integrating its robust single-node capabilities into Dynamo within the next several months. We will maintain ongoing support for Triton while ensuring a seamless migration path for existing users to Dynamo once feature parity is achieved. As a vendor-agnostic serving framework, Dynamo supports multiple LLM inference engines including TRT-LLM, vLLM, and SGLang, with varying degrees of maturity and support.

Dynamo v0.1.1 features:

  • Benchmarking guides for Single and Multi-Node Disaggregation on H100 (vLLM)
  • TensorRT-LLM support for KV Aware Routing
  • TensorRT-LLM support for Disaggregation
  • ManyLinux and Ubuntu 22.04 Support for wheels and crates
  • Unified logging for Python and Rust

Future plans

  • Instructions for reproducing benchmark guides on GCP and AWS
  • KV Cache Manager as a standalone repository under the ai-dynamo organization. This release will provide functionality for storing and evicting KV cache across multiple memory tiers, including GPU, system memory, local SSD, and object storage.
  • Searchable user guides and documentation
  • Multi-node instances for large models
  • Initial Planner version supporting dynamic scaling of P / D workers. We will include an early version of the Dynamo Planner, another core component. This initial release will feature heuristic-based dynamic allocation of GPU workers between prefill and decode tasks, as well as model and fleet configuration adjustments based on user traffic patterns. Our vision is to evolve the Planner into a reinforcement learning platform, which will allow users to define objectives and then tune and optimize performance policies automatically based on system feedback.
  • vLLM 1.0 support with NIXL and KV Cache Events

Known Issues

  • Benchmark guides are still being validated on public cloud instances (GCP / AWS)
  • Benchmarks on internal clusters show a 15% degradation from results displayed in summary graphs for multi-node 70B and are being investigated.

What's Changed

Read more

Dynamo Release v0.1.0

18 Mar 03:37
Compare
Choose a tag to compare

Dynamo v0.1.0 version will be released following Jensen Huang’s GTC keynote, and the product will be hosted on github.com/ai-dynamo. It’s an open source project with Apache 2 license, and public continuous integration will be available from the start to enable industry-wide collaboration. The primary distribution will be through pip wheels with minimal binary size. The ai-dynamo github org will host 2 repos: dynamo and NIXL.

Initial Dynamo release features:

  • Disaggregated serving with X prefill and Y decode nodes
  • KV aware routing
  • KV cache manager to offload KV cache to system memory
  • NIXL support for RDMA (InfiniBand, Ethernet) and TCP
  • Support for K8s deployment

As a vendor-agnostic serving framework, Dynamo supports multiple LLM inference engines including TRT-LLM, vLLM, and SGLang at launch, with varying degrees of maturity and support. Dynamo supports the vLLM engine with all the capabilities mentioned above, with a plan to achieve feature parity with the rest of inference engines as soon as possible.

Future plans
The next release of Dynamo plans to open-source the KV cache manager as a standalone repository under the ai-dynamo organization. This release will provide functionality for storing and evicting KV cache across multiple memory tiers, including GPU, system memory, local SSD, and object storage.

In that release, we will include an early version of the Dynamo Planner, another core component. This initial release will feature heuristic-based dynamic allocation of GPU workers between prefill and decode tasks, as well as model and fleet configuration adjustments based on user traffic patterns. Our vision is to evolve the Planner into a reinforcement learning platform, which will allow users to define objectives and then tune and optimize performance policies automatically based on system feedback.
Dynamo is designed as the ideal next generation inference server, building upon the foundations of the Triton Inference Server. While Triton focuses on single-node inference deployments, we are committed to integrating its robust single-node capabilities into Dynamo within the next several months. We will maintain ongoing support for Triton while ensuring a seamless migration path for existing users to Dynamo once feature parity is achieved.