v0.20.0rc3
Pre-release
Pre-release
Highlights
- Model Support
- Features
- Adopt new
logprob
definition in PyTorch flow (#4057) - Support multiple LoRA adapters and TP (#3885)
- Add Piecewise CUDA Graph support (#3804)
- Add KV cache-aware router for disaggregated serving (#3831)
- Enable per-request stats with PyTorch backend (#4156)
- Support DeepSeek-R1 W4A8 on Hopper (#4123)
- Enable chunked context for FlashInfer (#4132)
- Support KV cache reuse for MLA (#3571)
- Adopt new
- API
- Bug Fixes
- Benchmark
- Performance
- Infrastructure
- The base Docker image for TensorRT-LLM is updated to
nvcr.io/nvidia/pytorch:25.04-py3
(#4049) - The dependent TensorRT version is updated to 10.10.0 (#4049)
- The dependent CUDA version is updated to 12.9.0 (#4049)
- The dependent public PyTorch version is updated to 2.7.0.
- The pre-built TensorRT-LLM wheel on PyPI is linked against PyTorch 2.7.0 now, which uses the CXX11 ABI (#4235)
- The base Docker image for TensorRT-LLM is updated to
- Documentation
- Known Issues
What's Changed
- feat: adopt new logprob definition in PyTorch flow by @tongyuantongyu in #4057
- infra: Add NIXL into the Dockerfile by @Shixiaowei02 in #3981
- feat: support multi lora adapters and TP by @shaharmor98 in #3885
- feat: Fallback to NCCL for various patterns when input size is large. by @hyukn in #4080
- Cherry-pick trtllm-gen from feat/llama4 to main by @chenfeiz0326 in #4086
- [fix] [AutoDeploy] flashinfer usage on H100 by @lucaslie in #4162
- fix: Fix incorrect conversion of Gen TPS/user by @FrankD412 in #4112
- [fix] Fix llama4 + eagle3 by @mikeiovine in #3998
- Support RingAttention in the BertAttention plugin and the DiT model by @ChunhuanLin in #3661
- fix: alltoall padding for chunked MoE by @dongxuy04 in #4157
- [feat] Allow overriding cli args with yaml file in trtllm-serve by @pcastonguay in #4164
- [TRTLLM-5147][Qwen3] fix: fix bug of attention dp on qwen3_moe model by @byshiue in #4141
- chore: Clean up the legacy DeepseekAllreudceFusionOp. by @hyukn in #4081
- test: add qwen3 and disaggregated serving accuracy tests to qa test list by @StanleySun639 in #4083
- [TRTLLM-3105][feat] Add Piecewise CUDA Graph Support by @yizhang-nv in #3804
- fix: change pp broadcast pattern for LPs by @hchings in #4130
- [#4085][fix] Fix
apply_per_channel_scale
for extremely large input sequence length. by @StudyingShao in #4089 - [nvbug/5262268][fix] Fix trtllm-bench for llama 4 by @mikeiovine in #4104
- chore: Fix pipeline break caused by previous PR (#4081) rebase + pipeline reuse by @hyukn in #4169
- [https://nvbugspro.nvidia.com/bug/5260676]test: skip fp8 quantization case for pre-ada by @crazydemo in #4095
- test: move mistral / mixtral test cases in QA test list into the new accuracy test suite by @crazydemo in #3440
- test: Add fp8kv to DS-v3-lite integration tests. by @bobboli in #3950
- [fix] Fix relaxed acceptance to support enabling it in context phase by @lfr-0531 in #4126
- test: skip tests on b200 by @xinhe-nv in #3913
- infra: Fix pipeline step error in post merge by @ZhanruiSunCh in #3948
- fix: library path of nixl by @Shixiaowei02 in #4184
- test: amend default pytorch extra-llm-api-config.yml in perf test by @ruodil in #4176
- [fix] Fix add_dummy_requests for spec decoding cases by @lfr-0531 in #4084
- test: [CI] Add failed cases into waives.txt by @xinhe-nv in #4165
- feat: support task collection for to collect information (#3328) by @WeiHaocheng in #3824
- Cherry-pick: Use multi-threading to load MoE expert weights by @chenfeiz0326 in #4137
- test: amend regex match for perf throughput by @ruodil in #4186
- chore: reduce size of the docker images by @MartinMarciniszyn in #3990
- [fix] trtllm-gen mla kernel warnings by @zhhuang-nv in #4119
- chore: Deprecate evaltool by @Tracin in #4173
- [fix][nvbug/5244009] Fix llama 4 test lists/scout accuracy issue by @mikeiovine in #4069
- perf: [TRTLLM-4717][perf] Set CUDA graph max batch size and padding in throughput benchmark. by @FrankD412 in #3875
- Refactor: Restructure C++ tests for better modularisation of non-shared code by @DomBrown in #4027
- Updating the multimodal models README to add steps for running phi-4-multimodal instruct by @mayani-nv in #3932
- fix: draft target README and assertion for logits-based acceptance by @mayani-nv in #4167
- Add initial list of CODEOWNERS by @kevinch-nv in #4105
- chore: PR to fix the formatting errors by @mayani-nv in #4200
- test: Remove CNN Dailymail tasks in favor of GSM8K by @syuoni in #4187
- [CI] waive two multi-gpu test cases by @QiJune in #4206
- [CI] update pytorch only file list by @QiJune in #4210
- chore:update modelopt to 0.29 by @nv-guomingz in #4150
- [Infra] Waive L0 test by @yiqingy0 in #4212
- remove cache_transceiver_prealloc_size by @chuangz0 in #4153
- [TRTQA-2802][fix]: add --host for mgmn serve examples script by @xinhe-nv in #4175
- tests: https://nvbugs/5219534 remove failed tests from test list by @xinhe-nv in #4113
- test: add llama_3.2_1B model and fix for test lora script issue by @ruodil in #4139
- chore: Update CODEOWNERS by @Funatiq in #4221
- [https://nvbugspro.nvidia.com/bug/5270564][test] skip per-hopper for llama4 by @crazydemo in #4211
- [TRTLLM-4911] feat(scaffolding): make sampling_params only setable by controller by @dc3671 in #4151
- Feat: support exporting softmax statistics and update the kernel-selection heuristic by @PerkzZheng in #4155
- infra: [TRTLLM-325] Prepare for NGC release - multiplatform build by @MartinMarciniszyn in #4191
- [feat] Support HyperCLOVAX-SEED-Text language part by @yechank-nvidia in #3902
- feat: Support the Structural Tag in guided decoding by @Ubospica in #4066
- feat: add kv cache aware router by @zhengd-nv in #3831
- refactor: Allow models to override apply_qk_norm. by @yuxianq in #4078
- [https://nvbugs/5214229] [fix] Unwaive lm_head quantization case by @syuoni in #4222
- doc: update switcher.json config by @niukuo in #4220
- Revert "Add initial list of CODEOWNERS (#4105)" by @Funatiq in #4234
- [TRTLLM-5188] fix: [AutoDeploy] update output shape of prepare_fused_mha_metadata_fake by @Fridah-nv in #4199
- fix: Reset planned states to avoid memory leak in TrtllmAttentionWrapper by @yuxianq in #4227
- Feat: Variable-Beam-Width-Search (VBWS) part4 by @wili-65535 in #3979
- [TRTLLM-5081] [test] Align parametrize_with_ids to the pytest behavior by @syuoni in #4090
- fix: reshape token_ids for lp in torch backend by @hchings in #4239
- feat: Add heuristic for GroupRMSNorm kernel selection. by @SimengLiu-nv in #4047
- [TRTLLM-5050][feat] Enable per-request stats with PyT backend by @pcastonguay in #4156
- test: [CI] Add failed cases into waives.txt by @xinhe-nv in #4203
- test: [CI] Add failed cases into waives.txt by @xinhe-nv in #4205
- test: fix for perf test script issue by @ruodil in #4230
- doc: update qwen3 document by @byshiue in #4246
- feat: Prefetch safetensors files before loading them by @nvpohanh in #4140
- Fix Pipeline Parallelism in Llama4 by @v-shobhit in #4106
- [https://nvbugspro.nvidia.com/bug/5238626] illegal memory address when running llama 4 with cuda graph enabled by @PerkzZheng in #4101
- [Infra][TRTLLM-4374] Upgrade TRT 10.10.0 GA, CUDA 12.9 GA and DLFW 25.04 by @yiqingy0 in #4049
- [https://nvbugs/5220763] [test] Unwaive Mixtral FP8 TP2 test by @syuoni in #4252
- [nvbugs/5268808][fix] Fix the potential out-of-range-access issue of allreduce workspace. by @hyukn in #4159
- Waive stress test. by @dominicshanshan in #4262
- [TRTLLM-5233][feat]: Add chunking to PyT heuristic for trtllm-bench. by @FrankD412 in #4133
- [Infra] Waive L0 test by @yiqingy0 in #4268
- [Infra] Waive L0 test by @yiqingy0 in #4269
- feat: Support Mistral Small 3.1 24B VLM in TRT workflow by @brb-nv in #4183
- Waive disagg kv cache load balancer test by @Tabrizian in #4276
- fix: Merge PP overlap and non-overlap executor loop by @amukkara in #3878
- test: Validate FP8 and LoRA for Gemma3 by @brb-nv in #3670
- chore: bump version to 0.20.0rc3 by @ZhanruiSunCh in #4261
- [TRTLLM-5188] fix: [AutoDeploy] unwaive AD build test by @Fridah-nv in #4273
- [chore] update CI allowlist 2025-05-13 by @tburt-nv in #4278
- [fix] Enable pp tests by @yizhang-nv in #3978
- feat: Support Gemma3-1b-it in Pytorch workflow by @brb-nv in #3999
- CI: add fp8/fp4 ci on Qwen3-30B-A3B by @byshiue in #4266
- test: Add UT for moe trtllmgen by @zongfeijing in #4258
- [TRTLLM-3330][feat] Support DeepSeek-R1 W4A8 on Hopper by @Barry-Delaney in #4123
- [Infra] Waive L0 test by @yiqingy0 in #4295
- feat: Low Precision Allreduce for PCIe based GPU by @kanghui0204 in #3851
- tests: PyTorch multimodal using keyword match by @amukkara in #4215
- [bug/5247505] fix: CP accuracy on Blackwell by @DylanChen-NV in #4188
- test: [CI] remove closed bugs by @xinhe-nv in #4207
- Add test case for kv memory estimation by @HuiGao-NV in #4158
- chore: Remove deprecated Python runtime benchmark by @kaiyux in #4171
- fix: Eagle decoding in TRT flow by @Funatiq in #4229
- [Infra] - Update the upstream PyTorch dependency to 2.7.0 by @chzblych in #4235
- Added tests for Llama3.1-70B-BF16 on SM120 by @farazkh80 in #4198
- feat: [AutoDeploy] DSV3 mla attn ref op by @sugunav14 in #4272
- [TRTLLM-5171] chore: Remove GptSession/V1 from TRT workflow by @Funatiq in #4092
- [fix] Remove stale cublas heuristics by @hlu1 in #4326
- [doc] Add tensorrtllm_backend serving documentation in the Deepseek-V3 README by @SimengLiu-nv in #4338
- Revert "feat: Low Precision Allreduce for PCIe based GPU" by @QiJune in #4340
- infra: open source fmha v2 kernels by @qsang-nv in #4185
- [feat] Enable chunked context for flashinfer by @mikeiovine in #4132
- [TRTLLM-2795] feat: Add yarn support for other models in trt-flow by @uchihatmtkinu in #3840
- infra: Down the gcc toolset version from 13 to 11 by @ZhanruiSunCh in #4114
- fix:https://nvbugs/5234033 enable starcoder trt-flow with transforme… by @nv-guomingz in #3909
- [test] Reorganize TestDeepSeekR1::test_nvfp4_8gpus by @hlu1 in #4346
- [test] add qa test mentioned in docs by @crazydemo in #4248
- feat:[AutoDeploy] Update MoE pattern matcher to drop expert selection logic by @Fridah-nv in #3283
- [https://nvbugs/5277113][fix]genai-perf API change stress test by @dominicshanshan in #4300
- Breaking change: perf: Enable scheduling overlap by default by @kaiyux in #4174
- feat: support kv cache reuse for MLA by @zhhuang-nv in #3571
- test: FIX test_ptp_quickstart_advanced_deepseek_v3_2nodes_8gpus by @xinhe-nv in #4283
- Add allreduce and rmsnorm fusion for qwen3 by @zongfeijing in #4304
- chore: reduce code duplication by @ixlmar in #4297
- fix: better method to help torch find nvtx3 by @tongyuantongyu in #4110
- [fix] test_no_kv_cache_reuse for overlap_scheduler by @zhhuang-nv in #4350
- test: add qa test list for rtx5090 and rtx_pro_6000 by @StanleySun639 in #4254
- Revert "[test] add qa test mentioned in docs" by @chzblych in #4355
- refactor: use x is None instead of x == None. by @yuxianq in #4244
- test(perf): Add
Phi-4-mini-instruct
to perf tests by @venkywonka in #4267 - enh: Enable option in trtllm-bench build subcommand to avoid loading weights by @venkywonka in #4142
- feat: [nvbugs/5261055][nvbugs/5170160] non-invasive pipeline parallelism by @yuxianq in #4034
- fix: update checks that broke medusa tests when use_py_session=True by @hchings in #4339
- Move Triton backend to TRT-LLM main by @Tabrizian in #3549
- feat: enhance trtllm serve multimodal by @yechank-nvidia in #3757
- [AutoDeploy] fix: disable overlap scheduler until supported by @lucaslie in #4365
- [TRTLLM-5054][fix] Removing repeated loading of input processor by @rakib-hasan in #4161
- [AutoDeploy]feat: Add an AutoDeploy compile backend that only calls torch.compile by @suyoggupta in #4240
- [CI] update multi-gpu test triggering file list by @QiJune in #4378
- doc: Add docstring for Attention and MLA module. by @yuxianq in #4354
- Fix bias shape in weightOnlyGroupwiseQuantMatmulPlugin for TRT workflow by @StudyingShao in #4348
- [CI] waive test_chunked_prefill test cases by @QiJune in #4380
- update README version by @ZhanruiSunCh in #4381
- feat: support benchmark on scaffolding (#3328) by @WeiHaocheng in #4286
- test: add kv cache aware test cases to qa test list by @StanleySun639 in #4257
- [TRTLLM 4571] Support dynamic per-tensor FP8 by @Tracin in #4250
- [fix] Fixed incorrect mixed precision MoE conversion by @Barry-Delaney in #4351
- test: [CI] remove closed bugs by @xinhe-nv in #4345
- fix: support TensorRT 10.11+ in FindTensorRT.cmake by @tongyuantongyu in #4353
- Change the method to calculate kv memory size in tests by @HuiGao-NV in #4332
- chore: improve log-level setting UX by @ixlmar in #4352
- chore: Mass Integration 0.19 by @dcampora in #4255
- Fix test_fused_moe_w4afp8 by @StudyingShao in #4393
- [TRTLLM-4886][infra]Try another timeout opt to exit test thread directly instead of gracefully by @EmmaQiaoCh in #4341
- feat: TRT-LLM Gen integration for BMM and MoE refactoring by @nekorobov in #4280
- [CI] waive accuracy/test_cli_flow.py::TestTinyLlama1_1BChat::test_pp4 by @liji-nv in #4397
- doc: DS r1 min latency blog by @Kefeng-Duan in #4386
- feat: [AutoDeploy] update rope matcher with minor variants (Deepseek) by @Fridah-nv in #3638
- refactor: Copy sequence lengths once in decoder setup by @Funatiq in #4102
- [AutoDeploy] configurable cache resize by @lucaslie in #4372
- fix: Fix chat template kwargs bug. by @Tracin in #4387
- fix: improve PyExecutor resource allocations by @ixlmar in #4299
- API Breaking Change + Readability: "decoder"->"sampler" by @netanel-haber in #4121
- [AutoDeploy] fix: proper process group clean up by @lucaslie in #4373
- [AutoDeploy] eager pattern matcher new pattern by @lucaslie in #4370
- [Deepseek] Add accuracy test references for fp8 kvcache by @hlu1 in #4374
- perf: Eliminate the need for attention DP padding when possible by @jinyangyuan-nvidia in #3439
- test: Waive tests for nvbugs/5286795. by @yuxianq in #4409
- Extend the Llama-Nemotron-Nano-8B perf-integration-tests (cpp) by @venkywonka in #4195
- infra: [TRTLLM-5072] Add SBSA release images by @ZhanruiSunCh in #4231
- [Infra] - Terminate the Slurm job if node does not come online in 2 hours by @yuanjingx87 in #4334
- Removing the outdated argument by @rakib-hasan in #4408
- fix: Remove real size allocation by @kaiyux in #4396
- add changes for fp8, nemotron-nas, API by @shaharmor98 in #4180
- [Infra][Docs] - Some clean-up for the CI pipeline and docs by @chzblych in #4419
- [https://nvbugspro.nvidia.com/bug/5243740][fix] deduce default max_tokens for trtllm-serve by @LinPoly in #4265
New Contributors
- @chenfeiz0326 made their first contribution in #4086
- @ChunhuanLin made their first contribution in #3661
- @StudyingShao made their first contribution in #4089
- @Ubospica made their first contribution in #4066
- @v-shobhit made their first contribution in #4106
- @qsang-nv made their first contribution in #4185
- @uchihatmtkinu made their first contribution in #3840
- @ixlmar made their first contribution in #4297
- @nekorobov made their first contribution in #4280
Full Changelog: v0.20.0rc2...v0.20.0rc3