Skip to content

v0.20.0rc3

Pre-release
Pre-release
Compare
Choose a tag to compare
@Shixiaowei02 Shixiaowei02 released this 20 May 09:42
· 417 commits to main since this release
039f7e3

Highlights

  • Model Support
    • Support Mistral Small 3.1 24B VLM in TRT workflow (#4183)
    • Support Gemma3-1b-it in PyTorch workflow (#3999)
  • Features
    • Adopt new logprob definition in PyTorch flow (#4057)
    • Support multiple LoRA adapters and TP (#3885)
    • Add Piecewise CUDA Graph support (#3804)
    • Add KV cache-aware router for disaggregated serving (#3831)
    • Enable per-request stats with PyTorch backend (#4156)
    • Support DeepSeek-R1 W4A8 on Hopper (#4123)
    • Enable chunked context for FlashInfer (#4132)
    • Support KV cache reuse for MLA (#3571)
  • API
    • Allow overriding CLI arguments with YAML file in trtllm-serve (#4164)
    • Remove deprecated GptSession/V1 from TRT workflow (#4092)
  • Bug Fixes
    • Fix attention DP bug on Qwen3 MoE model (#4141)
    • Fix illegal memory access when running LLaMA 4 with CUDA Graph enabled (#4101)
  • Benchmark
    • Remove deprecated Python runtime benchmark (#4171)
    • Add benchmark support for scaffolding (#4286)
  • Performance
  • Infrastructure
    • The base Docker image for TensorRT-LLM is updated to nvcr.io/nvidia/pytorch:25.04-py3 (#4049)
    • The dependent TensorRT version is updated to 10.10.0 (#4049)
    • The dependent CUDA version is updated to 12.9.0 (#4049)
    • The dependent public PyTorch version is updated to 2.7.0.
    • The pre-built TensorRT-LLM wheel on PyPI is linked against PyTorch 2.7.0 now, which uses the CXX11 ABI (#4235)
  • Documentation
  • Known Issues

What's Changed

New Contributors

Full Changelog: v0.20.0rc2...v0.20.0rc3