Release v0.21.0rc1 · NVIDIA/TensorRT-LLM

Highlights

Model Support
- Add HyperCLOVAX-SEED-Vision support for PyTorch flow (#4799)
Features
- Support generation logits in TRTLLM Sampler (#4819）
- Support for large scale-EP(#4818)
- Support XQA-based MLA on SM120 (#4858)
- Add PositionEmbeddingType=0 to xqa support (#4934)
- Add cache reuse support (selective cache transfer) in mla cache formatter (#4749)
- Update DeepSeek FP8 TRT-LLM Gen cubins (#4643)
- Add heuristics for checkpoint files prefetching (#4765)
- Enable NVFP4 output for TRTLLM attention kernels (#4737)
- Refactor Fused MoE (#4790)
- Add integration of etcd (#3738)
- Memoize weight shuffle index to speed up weight preproc in moe_backend=TRTLLM (#4826)
- Enable Disaggregated serving for QWen-3 (#4929)
API
- Set _AutoDeployLlmArgs as primary config object (#4891)
Bug Fixes
- Fix warmup phase batch size out of range (#4986)
- Fix buffer count (#5007)
- Fix nvbug 5324252 test_resource_manager.py broken (#4925)
- Fix nvbug 5280806 2 model spec decode flow (#4807)
- Fix nvbug 5324248 test_pytorch_model_engine.py broken (#4973)
- Fix cuda graph padding for spec decoding (#4853)
- Correct the order of llm request state (#4781)
- Handle OOMs during KV cache estimation (#4690)
- Only pass fast_build=true to non-pytorch backend (#4920)
- Fix the no fusion all reduce hanging (#4594)
- Deprecate AutoDeploy CI post-merge tests and keep them for local testing (#4892)
- Fix nvbug 5302895 test_trtllm_bench_llmapi_launch fail(#4835)
- Fix llama 4 long context issue (#4809)
- Fix nvbug 5300080 the bug of setting attention_chunk_size and enable
- chunked-attention in the generation-phase by default (#4693)
- Fix nvbug 5294316 queued request stats (#4714)
- Fix max_num_sequences calculation with overlap scheduling (#4532)
- Fix trtllm-bench hang issue due to LLM API IPC (#4798)
- Fix a pd+mtp accuracy issue (#4536)
Benchmark
- Add beam width to low latency. (#4812)
- Fix trtllm-bench iter_stats and cuda_graph_batch_sizes errors. (#4827)
Performance
Infrastructure
- TRT-LLM team formally releases docker image on NGC.
- Update jnlp version in container image (#4944)
- Upgraded ModelOpt to 0.31.0 (#5003)
- Upgrade Flash-infer to 0.2.5 (#5004)
Documentation
- doc: Document the docker release image on NGC #4705
- Fix readme for disaggregated serving (#4846)
- Fix draft target README and set exclude_input_in_output to False (#4882)
- blog: Scaling Expert Parallelism in TensorRT-LLM (Part 1: Design and Implementation of Large-scale EP) #4958
Known Issues

What's Changed

upgrade cutlass to 4.0 by @yunruis in #4794
[feat] Enable NVFP4 output for TRTLLM attention kernels by @Tom-Zheng in #4737
[https://nvbugs/5271281][fix] fix a pd+mtp accuracy issue by @lfr-0531 in #4536
fix [nvbug5256044]: bench hang due to llmapi ipc by @Superjomn in #4798
[nvbugs/5303555] ci: unwaive test_fp8_block_scales_cuda_graph_padding by @Funatiq in #4735
fix: remove the accuracy assert on run_majority_vote_aime24.py #5340 by @WeiHaocheng in #4784
feat: add heuristics for checkpoint files prefetching. by @yuxianq in #4765
tests: [TRTQA-2905] improve timeout report for qa test cases by @crazydemo in #4753
shorten reqs in con:1 cases and add streaming cases, and add l2 perf … by @ruodil in #4849
Add pre-merge Triton backend tests by @Tabrizian in #4842
[Architecture] Refactor FusedMoE by @hlu1 in #4790
fix: max_num_sequences calculation with overlap scheduling by @Funatiq in #4532
refactor: Separate DecoderState from GptDecoderBatched by @Funatiq in #4700
[enhanchment] Add beam width to low latency. by @FrankD412 in #4812
fix: Register MoeLoadBalancerConfig to serialization.py by @syuoni in #4864
feat: Add integration of etcd by @Shunkangz in #3738
[nvbug 5294316] fix: Fix queued request stats by @pcastonguay in #4714
chore: remove request_error ipc in LLM.submit by @Superjomn in #4763
[Doc] Fix readme for disaggregated serving by @arekay in #4846
chore: Waive examples/test_mistral.py::test_llm_mistral_v1_1gpu. by @SimengLiu-nv in #4873
[Arch] Freeze model_config by @hlu1 in #4814
[TRTLLM-5053] Refactoring and Unifying the Multimodal input preparation by @rakib-hasan in #4506
feat: update DeepSeek FP8 TRT-LLM Gen cubins by @nekorobov in #4643
[https://nvbugspro.nvidia.com/bug/5300080] Fix the bug of setting attention_chunk_size and enable chunked-attention in the generation-phase by default by @PerkzZheng in #4693
[fix] Fix llama 4 long context by @mikeiovine in #4809
Replace memset with data initialization within kernels by @ChristinaZ in #4851
Refactor the first token response in PD by @Shunkangz in #4692
Fix: NVBug 5302895 by @Shixiaowei02 in #4835
feat: cache reuse support (selective cache transfer) in mla cache formatter by @zhengd-nv in #4749
feat: Enhance AutoTuner inference path and code readability by @hyukn in #4466
Chore: refine comments of prepare inputs method of model engine by @QiJune in #4837
fix: build_config in TorchLlmArgs and avoid invalid args by @Superjomn in #4600
chore: Mass integration of release/0.20. by @omera-nv in #4871
[TRTLLM-4923][feat] Paged mamba cache by @tomeras91 in #4822
chore: bump version to 0.21.0rc1 by @ZhanruiSunCh in #4896
Fix: draft target README and set exclude_input_in_output to False by @eagle705 in #4882
fix: correct the order of llm request state by @zhengd-nv in #4781
fix: trtllm-bench iter_stats and cuda_graph_batch_sizes error errors. by @qiaoxj07 in #4827
chore: introduce KvCacheCreator by @ixlmar in #4581
tests: Update gb200 test case by @yizhang-nv in #4754
fix: Fix broken vanilla moe since FusedMoE refactor. by @yuxianq in #4897
fix: LLM invalid arg in a test by @Superjomn in #4922
[AutoDeploy] deprecate CI post-merge tests and keep them for local testing by @lucaslie in #4892
[infra] Unwaive unittests/_torch by @mikeiovine in #4919
[TRTLLM-4647][fix] Fix the no fusion allreduce hanging by @timlee0212 in #4594
tests: fix 5273697 by @xinhe-nv in #4685
Waive L0 tests by @yiqingy0 in #4927
Only pass fast_build=true to non-pytorch backend by @netanel-haber in #4920
tests: [TRTQA-2906] add benchmark serving tests by @xinhe-nv in #4901
fix: handle OOMs during KV cache estimation by @ixlmar in #4690
CI: waive test_llm_get_queued_stats by @QiJune in #4945
[AutoDeploy] _AutoDeployLlmArgs as primary config object by @lucaslie in #4891
Revert "[infra] Unwaive unittests/_torch" by @QiJune in #4950
Revert "fix: build_config in TorchLlmArgs and avoid invalid args" by @QiJune in #4949
[TRTLLM-5630] restore free_gpu_memory_fraction=0.9 in tests by @ixlmar in #4859
Add disaggregated unittest by @Shunkangz in #4899
Waive L0 tests by @yiqingy0 in #4953
fix a bug of global cuda graph dummy request by @QiJune in #4894
Fix: fix autodeploy by @QiJune in #4957
feat : add PositionEmbeddingType=0 to xqa support by @dongjiyingdjy in #4934
update fmha_v2 by @qsang-nv in #4895
blog: Scaling Expert Parallelism in TensorRT-LLM (Part 1: Design and Implementation of Large-scale EP) by @kaiyux in #4958
infra: update jnlp version in container image by @niukuo in #4944
doc: expose Large-scale EP design and implementation tech blog in the main… by @juney-nvidia in #4960
Revert "fix a bug of global cuda graph dummy request" by @QiJune in #4970
doc: refinement based on Julien's feedbacks by @juney-nvidia in #4967
test: [CI] Add failed cases into waives.txt by @xinhe-nv in #4966
CI: waive test_llm_multi_node_with_postproc by @QiJune in #4977
chore: memoize weight shuffle index to speed up weight preproc in moe_backend=TRTLLM by @rosenrodt in #4826
fix: fix cuda graph padding for spec decoding by @lfr-0531 in #4853
[feat] Support XQA-based MLA on SM120 by @jinyangyuan-nvidia in #4858
fix:https://nvbugs/5324248 by @nv-guomingz in #4973
[TRTLLM-5692][tests] Add speculative decoding test cases on torch flow by @crazydemo in #4940
chore: Change the type annotations of input_ids and position_ids to int32. by @bobboli in #4632
chore:set the flashinfer to 0.2.5. by @nv-guomingz in #5004
Resubmit #4894 by @QiJune in #4969
feat: large-scale EP(part 6: Online EP load balancer integration for GB200 nvfp4) by @dongxuy04 in #4818
chore:update modelopt to 0.31 by @nv-guomingz in #5003
[Infra] - Update JNLP container config by @chzblych in #5008
[nvbug/5280806][fix] Fix 2 model spec decode flow by @mikeiovine in #4807
chore: Mass integration of release/0.20 by @omera-nv in #4898
fix:https://nvbugs/5324252 by @nv-guomingz in #4925
Edits for tech blog 4 by @jdemouth-nvidia in #5006
feat: add HyperCLOVAX-SEED-Vision support in refactored way by @yechank-nvidia in #4799
[TRTLLM-4987][feat] Support generation logits in TRTLLMSampler by @amitz-nv in #4819
feat: Add Mixture of Experts FP8xMXFP4 support by @djns99 in #4750
ci: unwaive llmapi launch test by @Superjomn in #4991
Fix buffer count by @chuangz0 in #5007
Kv cache transfer support duplicate heads by @chuangz0 in #4929
Waive L0 test by @yiqingy0 in #5024
chore: Refactor apply_rope. by @bobboli in #4918
Add customized renormalized moe routing kernel for moe cutlass backend by @ChristinaZ in #4955
[fix] Fix illegal mem access and possible accuracy lose. Cherry-pick … by @liji-nv in #5017
[TRTLLM-5589] feat: Integrate TRT-LLM Gen FP8 Batched GEMM with Pytorch workflow kernel autotuner by @DomBrown in #4872

New Contributors

@Tom-Zheng made their first contribution in #4737
@arekay made their first contribution in #4846
@omera-nv made their first contribution in #4871
@eagle705 made their first contribution in #4882
@timlee0212 made their first contribution in #4594
@jdemouth-nvidia made their first contribution in #5006

Full Changelog: v0.21.0rc0...v0.21.0rc1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v0.21.0rc1

Highlights

What's Changed

New Contributors

Contributors

Uh oh!