v0.21.0rc1
Pre-release
Pre-release
Highlights
- Model Support
- Add HyperCLOVAX-SEED-Vision support for PyTorch flow (#4799)
- Features
- Support generation logits in TRTLLM Sampler (#4819)
- Support for large scale-EP(#4818)
- Support XQA-based MLA on SM120 (#4858)
- Add PositionEmbeddingType=0 to xqa support (#4934)
- Add cache reuse support (selective cache transfer) in mla cache formatter (#4749)
- Update DeepSeek FP8 TRT-LLM Gen cubins (#4643)
- Add heuristics for checkpoint files prefetching (#4765)
- Enable NVFP4 output for TRTLLM attention kernels (#4737)
- Refactor Fused MoE (#4790)
- Add integration of etcd (#3738)
- Memoize weight shuffle index to speed up weight preproc in moe_backend=TRTLLM (#4826)
- Enable Disaggregated serving for QWen-3 (#4929)
- API
- Set _AutoDeployLlmArgs as primary config object (#4891)
- Bug Fixes
- Fix warmup phase batch size out of range (#4986)
- Fix buffer count (#5007)
- Fix nvbug 5324252 test_resource_manager.py broken (#4925)
- Fix nvbug 5280806 2 model spec decode flow (#4807)
- Fix nvbug 5324248 test_pytorch_model_engine.py broken (#4973)
- Fix cuda graph padding for spec decoding (#4853)
- Correct the order of llm request state (#4781)
- Handle OOMs during KV cache estimation (#4690)
- Only pass fast_build=true to non-pytorch backend (#4920)
- Fix the no fusion all reduce hanging (#4594)
- Deprecate AutoDeploy CI post-merge tests and keep them for local testing (#4892)
- Fix nvbug 5302895 test_trtllm_bench_llmapi_launch fail(#4835)
- Fix llama 4 long context issue (#4809)
- Fix nvbug 5300080 the bug of setting attention_chunk_size and enable
- chunked-attention in the generation-phase by default (#4693)
- Fix nvbug 5294316 queued request stats (#4714)
- Fix max_num_sequences calculation with overlap scheduling (#4532)
- Fix trtllm-bench hang issue due to LLM API IPC (#4798)
- Fix a pd+mtp accuracy issue (#4536)
- Benchmark
- Performance
- Infrastructure
- Documentation
- Known Issues
What's Changed
- upgrade cutlass to 4.0 by @yunruis in #4794
- [feat] Enable NVFP4 output for TRTLLM attention kernels by @Tom-Zheng in #4737
- [https://nvbugs/5271281][fix] fix a pd+mtp accuracy issue by @lfr-0531 in #4536
- fix [nvbug5256044]: bench hang due to llmapi ipc by @Superjomn in #4798
- [nvbugs/5303555] ci: unwaive test_fp8_block_scales_cuda_graph_padding by @Funatiq in #4735
- fix: remove the accuracy assert on run_majority_vote_aime24.py #5340 by @WeiHaocheng in #4784
- feat: add heuristics for checkpoint files prefetching. by @yuxianq in #4765
- tests: [TRTQA-2905] improve timeout report for qa test cases by @crazydemo in #4753
- shorten reqs in con:1 cases and add streaming cases, and add l2 perf … by @ruodil in #4849
- Add pre-merge Triton backend tests by @Tabrizian in #4842
- [Architecture] Refactor FusedMoE by @hlu1 in #4790
- fix: max_num_sequences calculation with overlap scheduling by @Funatiq in #4532
- refactor: Separate DecoderState from GptDecoderBatched by @Funatiq in #4700
- [enhanchment] Add beam width to low latency. by @FrankD412 in #4812
- fix: Register MoeLoadBalancerConfig to serialization.py by @syuoni in #4864
- feat: Add integration of etcd by @Shunkangz in #3738
- [nvbug 5294316] fix: Fix queued request stats by @pcastonguay in #4714
- chore: remove request_error ipc in LLM.submit by @Superjomn in #4763
- [Doc] Fix readme for disaggregated serving by @arekay in #4846
- chore: Waive examples/test_mistral.py::test_llm_mistral_v1_1gpu. by @SimengLiu-nv in #4873
- [Arch] Freeze model_config by @hlu1 in #4814
- [TRTLLM-5053] Refactoring and Unifying the Multimodal input preparation by @rakib-hasan in #4506
- feat: update DeepSeek FP8 TRT-LLM Gen cubins by @nekorobov in #4643
- [https://nvbugspro.nvidia.com/bug/5300080] Fix the bug of setting attention_chunk_size and enable chunked-attention in the generation-phase by default by @PerkzZheng in #4693
- [fix] Fix llama 4 long context by @mikeiovine in #4809
- Replace memset with data initialization within kernels by @ChristinaZ in #4851
- Refactor the first token response in PD by @Shunkangz in #4692
- Fix: NVBug 5302895 by @Shixiaowei02 in #4835
- feat: cache reuse support (selective cache transfer) in mla cache formatter by @zhengd-nv in #4749
- feat: Enhance AutoTuner inference path and code readability by @hyukn in #4466
- Chore: refine comments of prepare inputs method of model engine by @QiJune in #4837
- fix: build_config in TorchLlmArgs and avoid invalid args by @Superjomn in #4600
- chore: Mass integration of release/0.20. by @omera-nv in #4871
- [TRTLLM-4923][feat] Paged mamba cache by @tomeras91 in #4822
- chore: bump version to 0.21.0rc1 by @ZhanruiSunCh in #4896
- Fix: draft target README and set exclude_input_in_output to False by @eagle705 in #4882
- fix: correct the order of llm request state by @zhengd-nv in #4781
- fix: trtllm-bench iter_stats and cuda_graph_batch_sizes error errors. by @qiaoxj07 in #4827
- chore: introduce KvCacheCreator by @ixlmar in #4581
- tests: Update gb200 test case by @yizhang-nv in #4754
- fix: Fix broken vanilla moe since FusedMoE refactor. by @yuxianq in #4897
- fix: LLM invalid arg in a test by @Superjomn in #4922
- [AutoDeploy] deprecate CI post-merge tests and keep them for local testing by @lucaslie in #4892
- [infra] Unwaive unittests/_torch by @mikeiovine in #4919
- [TRTLLM-4647][fix] Fix the no fusion allreduce hanging by @timlee0212 in #4594
- tests: fix 5273697 by @xinhe-nv in #4685
- Waive L0 tests by @yiqingy0 in #4927
- Only pass
fast_build=true
to non-pytorch backend by @netanel-haber in #4920 - tests: [TRTQA-2906] add benchmark serving tests by @xinhe-nv in #4901
- fix: handle OOMs during KV cache estimation by @ixlmar in #4690
- CI: waive test_llm_get_queued_stats by @QiJune in #4945
- [AutoDeploy] _AutoDeployLlmArgs as primary config object by @lucaslie in #4891
- Revert "[infra] Unwaive unittests/_torch" by @QiJune in #4950
- Revert "fix: build_config in TorchLlmArgs and avoid invalid args" by @QiJune in #4949
- [TRTLLM-5630] restore free_gpu_memory_fraction=0.9 in tests by @ixlmar in #4859
- Add disaggregated unittest by @Shunkangz in #4899
- Waive L0 tests by @yiqingy0 in #4953
- fix a bug of global cuda graph dummy request by @QiJune in #4894
- Fix: fix autodeploy by @QiJune in #4957
- feat : add PositionEmbeddingType=0 to xqa support by @dongjiyingdjy in #4934
- update fmha_v2 by @qsang-nv in #4895
- blog: Scaling Expert Parallelism in TensorRT-LLM (Part 1: Design and Implementation of Large-scale EP) by @kaiyux in #4958
- infra: update jnlp version in container image by @niukuo in #4944
- doc: expose Large-scale EP design and implementation tech blog in the main… by @juney-nvidia in #4960
- Revert "fix a bug of global cuda graph dummy request" by @QiJune in #4970
- doc: refinement based on Julien's feedbacks by @juney-nvidia in #4967
- test: [CI] Add failed cases into waives.txt by @xinhe-nv in #4966
- CI: waive test_llm_multi_node_with_postproc by @QiJune in #4977
- chore: memoize weight shuffle index to speed up weight preproc in moe_backend=TRTLLM by @rosenrodt in #4826
- fix: fix cuda graph padding for spec decoding by @lfr-0531 in #4853
- [feat] Support XQA-based MLA on SM120 by @jinyangyuan-nvidia in #4858
- fix:https://nvbugs/5324248 by @nv-guomingz in #4973
- [TRTLLM-5692][tests] Add speculative decoding test cases on torch flow by @crazydemo in #4940
- chore: Change the type annotations of input_ids and position_ids to int32. by @bobboli in #4632
- chore:set the flashinfer to 0.2.5. by @nv-guomingz in #5004
- Resubmit #4894 by @QiJune in #4969
- feat: large-scale EP(part 6: Online EP load balancer integration for GB200 nvfp4) by @dongxuy04 in #4818
- chore:update modelopt to 0.31 by @nv-guomingz in #5003
- [Infra] - Update JNLP container config by @chzblych in #5008
- [nvbug/5280806][fix] Fix 2 model spec decode flow by @mikeiovine in #4807
- chore: Mass integration of release/0.20 by @omera-nv in #4898
- fix:https://nvbugs/5324252 by @nv-guomingz in #4925
- Edits for tech blog 4 by @jdemouth-nvidia in #5006
- feat: add HyperCLOVAX-SEED-Vision support in refactored way by @yechank-nvidia in #4799
- [TRTLLM-4987][feat] Support generation logits in TRTLLMSampler by @amitz-nv in #4819
- feat: Add Mixture of Experts FP8xMXFP4 support by @djns99 in #4750
- ci: unwaive llmapi launch test by @Superjomn in #4991
- Fix buffer count by @chuangz0 in #5007
- Kv cache transfer support duplicate heads by @chuangz0 in #4929
- Waive L0 test by @yiqingy0 in #5024
- chore: Refactor apply_rope. by @bobboli in #4918
- Add customized renormalized moe routing kernel for moe cutlass backend by @ChristinaZ in #4955
- [fix] Fix illegal mem access and possible accuracy lose. Cherry-pick … by @liji-nv in #5017
- [TRTLLM-5589] feat: Integrate TRT-LLM Gen FP8 Batched GEMM with Pytorch workflow kernel autotuner by @DomBrown in #4872
New Contributors
- @Tom-Zheng made their first contribution in #4737
- @arekay made their first contribution in #4846
- @omera-nv made their first contribution in #4871
- @eagle705 made their first contribution in #4882
- @timlee0212 made their first contribution in #4594
- @jdemouth-nvidia made their first contribution in #5006
Full Changelog: v0.21.0rc0...v0.21.0rc1