v0.21.0rc0
Pre-release
Pre-release
Highlights
- Model Support
- Features
- Support for large-scale EP (#4384, #4495 , #4615)
- Added chunked attention kernels (#4291, #4394)
- ScaffoldingLLM now supports MCP (#4410)
- Integrated NIXL into the communication layer of the disaggregated service (#3934, #4125)
- Integrated Hopper chunked attention kernels (#4330)
- Enabled TRT backend for Python runtime in disaggregated service (#4243)
- Added FP8 block-scale GEMM support on SM89 (#4481)
- Qwen3 FP4 MoE TRTLLM backend for low-latency (#4530)
- Introduced sliding-window attention kernels for the generation phase on Blackwell (#4564)
- Vanilla MOE added (#4682)
- Fused QKNorm + RoPE integration (#4611)
- Fabric Memory support for KV Cache Transfer (#4717)
- API
- Bug Fixes
- Resolved Torch compile issue for DeepSeek V3 (#3952)
- Fixed trtllm-llmapi-launch for single-node, single-GPU setups (#4428)
- Removed duplicate tokenization in generation server (#4492)
- Fixed cancel request handling for attentionDP (#4648)
- Fixed disaggregated service hang when MNNVL two-shot AllReduce is enabled (#4678)
- Fixed queued request statistics (#4806)
- Fixed EP load balancer with MTP layer and route offset by EP rank (#4767)
- Resolved accuracy and illegal memory access issues with MTP + attention DP (#4379)
- Benchmark
- Added all_reduce.py benchmark script for testing (#4537)
- Performance
- Infrastructure
- Documentation
- Known Issues
What's Changed
- Refine doc by @juney-nvidia in #4420
- Refine doc by @juney-nvidia in #4421
- refine doc by @juney-nvidia in #4422
- Remove vila test by @Tabrizian in #4376
- [TRTLLM-4618][feat] Add Nemotron Super 49B FP8 test on RTX6000 Pro (SM120) by @farazkh80 in #4363
- tests: add qa test mentioned in docs by @crazydemo in #4357
- [Infra] - Always push the release images in the post-merge job by @chzblych in #4426
- tests: Add test cases for rcca cases by @crazydemo in #4347
- chore: cleanup perf_evaluator code by @Superjomn in #3833
- feat: Add pp support for hybrid attn/mamba model by @yuxianq in #4358
- fix: wrong argument name
enable_overlap_scheduler
by @kaiyux in #4433 - Update "Roadmap" link under README.md to the issues with Roadmap label by @AdamzNV in #4425
- fix potential issues in allreduce fusion kernel and ut by @yilin-void in #4226
- [TRTLLM-4638] feat(scaffolding): update Reward Controller to PRM specific controller with step split by @dc3671 in #4337
- feat: NIXL interface integration by @Shixiaowei02 in #3934
- Downgrade the logger level for fallback tactic warning. by @hyukn in #4440
- Test: Improve model re-use in C++ DGX tests for CI stability by @DomBrown in #4263
- fix: temp disable the problem test by @Shixiaowei02 in #4445
- Add llama4 disagg accuracy tests by @Tabrizian in #4336
- [https://nvbugs/5123103][fix] Fix torch compile for DeepSeekV3 by @liji-nv in #3952
- [Docs] - Reapply #4220 by @chzblych in #4434
- [TRTLLM-4618][feat] Fix cutlass MoE GEMM fallback failure on FP8 + add e2e test for Mixtral 8x7B FP8 on RTX6000 Pro (SM120) by @farazkh80 in #4335
- [Feat] add chunked-attention kernels on Hopper (for llama4) by @PerkzZheng in #4291
- test(perf): Add some
Llama-3_3-Nemotron-Super-49B-v1
integration-perf-tests (TRT flow, trtllm-bench) by @venkywonka in #4128 - fix: [nvbugs/5287097] Align PP layer distribution between pytorch and TRT flow. by @yuxianq in #4399
- feat: Low Precision Allreduce for PCIe based GPU by @kanghui0204 in #4344
- test: [CI] Add failed cases into waives.txt by @xinhe-nv in #4429
- [TRTLLM-4932] Add CLI accuracy tests for Llama-3.3-70B-Instruct and LLM API BF16 variant by @moraxu in #4362
- test: update test filter in perf test yml file to select cases by gpu name and add cases for RTX 6000 pro by @ruodil in #4282
- [AutoDeploy] HF factory improvements by @lucaslie in #4371
- chore: bump version to 0.21.0rc0 by @ZhanruiSunCh in #4465
- doc: [TRTLLM-325]Integrate the NGC image in Makefile automation and document by @MartinMarciniszyn in #4400
- infra: Add qwen3 235B tests into QA by @byshiue in #4483
- feat: large-scale EP(part 2: MoE Load Balancer - core utilities) by @dongxuy04 in #4384
- [TRTLLM-5085][fix] Nemotron H correctness test by @tomeras91 in #4444
- [Docs] - Add date and commit info by @chzblych in #4448
- fix[nvbug/5286515]: trtllm-llmapi-launch on single node single gpu by @Superjomn in #4428
- fix: replace the image links in the blog by @Shixiaowei02 in #4489
- fix: Fix TRTLLMSampler beam width bug. by @dcampora in #4473
- refactor: Unify request order in TRT and PyTorch workflow by @Funatiq in #4096
- [TRTLLM-5273]feat/Use full attention mask if Llama3 is used as encoder and fix EarlyStopDecoder unsqueeze bug by @nvrohanv in #4290
- Build Triton for arm by @Tabrizian in #4456
- test: [CI] remove closed bugs by @xinhe-nv in #4417
- test(perf): Add remaining
Phi-4-mini-instruct
perf tests by @venkywonka in #4443 - feat: conditional disaggregation in disagg server by @zhengd-nv in #3974
- perf: Fuse gemm setup function for SM90/SM100 MOE plugin path by @djns99 in #4146
- fix: skip weights defined in create_weights for pp. by @yuxianq in #4447
- Feat: add chunked-attention kernels on Blackwell by @PerkzZheng in #4394
- fix [nvbug/5220766]: llmapi-launch add add trtllm-bench test with engine building by @Superjomn in #4091
- [TRTLLM-5000][feat] Pytorch implementation of ngram drafter by @thorjohnsen in #3936
- test: NIXL single process test by @Shixiaowei02 in #4486
- Chore: waive torch compile test cases of deepseek v3 lite by @QiJune in #4508
- Feat: add deep_gemm swapab Kernel by @ruoqianguo in #4430
- unwaive some disagg test by @chuangz0 in #4476
- Clean: fmha codes by @PerkzZheng in #4496
- tests: add llama 3.3 70b 2 nodes tests by @xinhe-nv in #4391
- CI: waive test_fp8_block_scales_4gpus of deepseek v3 lite by @QiJune in #4520
- test: remove enable_overlap_schedule in pytorch config and set enable_chunked prefill to be true for isl>2048 cases by @ruodil in #4285
- docs: update the introduction for scaffolding by @WeiHaocheng in #4360
- test: add failed case in waive list and fix some test script issue for perf test by @ruodil in #4527
- tests: add qwene fp4 tests into QA test list & update sanity test list by @xinhe-nv in #4478
- feat: large-scale EP(part 3: refactor - FusedMoe for redundant expert) by @dongxuy04 in #4495
- refactor: DisaggExecutorTest by @Funatiq in #4398
- chore: clean ucx and nixl mirror. by @nv-guomingz in #4531
- Add pytorch backend team by @kevinch-nv in #4405
- test(perf): Pt.2 Add
Llama-3_3-Nemotron-Super-49B-v1
integration-perf-tests (cpp) by @venkywonka in #4499 - Adding two-shot allreduce kernel and mnnvl multicasting buffer by @zongfeijing in #4216
- test: Split test_simple into mpi_utils and cache transceiver tests for DGX by @DomBrown in #4451
- fix: TRT-LLM Gen dtype declaration by @nekorobov in #4503
- chore: remove extra PYTHONPATH by @achartier in #4453
- Agent interface impl for NIXL by @chuangz0 in #4125
- chore: Partition LlmArgs into TorchLlmArgs and TrtLlmArgs by @Superjomn in #3823
- [TRTLLM-4932] Add CLI accuracy tests for Phi-4-mini-instruct by @moraxu in #4415
- chore: Add all_reduce.py benchmark script to test by @kaiyux in #4537
- feat: add dataset support for benchmark_core_model with LLMAPI by @achartier in #4457
- fix[nvbug-5228840]: Remove test cases of feature not supported anymore by @HuiGao-NV in #3972
- feat: add health_generate route to openai serving (Cherry-pick #3856) by @kaiyux in #4349
- Add tritonrelease container by @Tabrizian in #4455
- cache_transceiver_config by @chuangz0 in #4556
- test: waive hanging cases for perf test by @ruodil in #4562
- test: [CI] Add failed cases into waives.txt by @xinhe-nv in #4549
- Chore: clean up _merge_dummy_request method of PyExecutor by @QiJune in #4438
- fix sequence data race by @chuangz0 in #4565
- fix: Move cv2 import to load_video function by @Funatiq in #4541
- test(perf): Add
Llama-3_1-Nemotron-Ultra-253B-v1
perf tests (cpp) by @venkywonka in #4446 - [nvbug/5285881][fix] Fix chunked prefill + overlap scheduler by @mikeiovine in #4402
- [feat] Integrate Hopper chunked attention kernels by @mikeiovine in #4330
- chroe:clean useless flag by @nv-guomingz in #4567
- Chore: clean up _gather_dp_requests_num method of PyExecutor by @QiJune in #4571
- fix[nvbug-5295425]: [TRTLLM-5385] fix race condition in MoeLoadBalancer by @dongxuy04 in #4573
- Scaffoldingllm supports MCP by @wu1du2 in #4410
- [feat][TRTLLM-5018] Dis serving python runtime trt backend by @pcastonguay in #4243
- chore: clean-up for header file. by @nv-guomingz in #4540
- [https://nvbugspro.nvidia.com/bug/5181262] [test] Unwaive Mistral Nemo test by @syuoni in #4515
- [feat] support fp8 blockscale gemm on sm89 by @CarstyYou in #4481
- fix: Fix moe_ep_groups/moe_cluster_groups in Mapping. by @yuxianq in #4555
- [https://nvbugs/5297775] fix: Correct memory guard for large MOE tests to account for TP space by @djns99 in #4553
- fix: [nvbugs/5066257] serialization improvments by @coldwaterq in #3869
- [Fix][Qwen3] fix bug of qwen3 fp4 workflow with EP by @byshiue in #4575
- [doc]: add mtp tech blog by @lfr-0531 in #4580
- chore: fix bug of llama lora test by @byshiue in #4566
- perf: Add fused q_norm/k_norm/RoPE for Qwen3. by @bobboli in #4482
- Waive L0 test by @yiqingy0 in #4609
- Update the GH main page to expose tech blogs by @juney-nvidia in #4610
- Qwen3 supports TRTLLM FP4 MoE backend by @rosenrodt in #4530
- [TRTLLM-5070][feat] Support FP8 KV Cache Reuse for MLA by @zhhuang-nv in #4535
- [nvbugs/5301492] ci: waive test_workers_kv_cache_aware_router by @Funatiq in #4617
- Update CODEOWNERS for PyTorch backend - runtime component by @juney-nvidia in #4620
- [nvbug/5028235][fix]pytest bindings tokens logtis comparison. by @dominicshanshan in #4424
- refactor: CreateNewDecoderRequests by @Funatiq in #4452
- fix: rename some terms by @lowsfer in #4534
- Fix invalid testcase name by @chzblych in #4626
- fix: datatype check in the cache transmission by @chuangz0 in #4606
- [Fix][Deepseek] Fix bugs in TestDeepSeekR1 by @hlu1 in #4413
- [TRTLLM-5327] - Add scan stage by @yiqingy0 in #4602
- [#4633][doc] Fixed typo in scaffolding README.md by @amemov in #4634
- Update main README.md with the LLaMA4 perf news by @juney-nvidia in #4636
- Fix snake case format by @shaharmor98 in #4559
- fix: Update approved list to fix pipeline tests after rebasing by @yibinl-nvidia in #4640
- Feat: add sliding-window-attention generation-phase kernels on Blackwell by @PerkzZheng in #4564
- feat: Skip sampler for intermediate pp stages. by @yuxianq in #4514
- Waive L0 tests by @yiqingy0 in #4645
- Chore: refine shutdown signal of PyExecutor by @QiJune in #4614
- chore: sort llm request state enums in chronological order by @zhengd-nv in #4607
- [TRTLLM-4535][infra]: Add marker TIMEOUT for test level by @EmmaQiaoCh in #3905
- fix: Handle additional model outputs based on pipeline parallel rank by @Funatiq in #4498
- [TRTLLM-5327] - Fix guardwords scan step by @yiqingy0 in #4654
- fix: Remove duplicate tokenization in generation server by @Shunkangz in #4492
- [nvbugs/5274894] fix: Sort requests for functional correctness and performance (adapted from #4608) by @Funatiq in #4621
- Chore: introduce RequestQueueItem class instead of using tuple by @QiJune in #4649
- feat: large-scale EP(part 4: Static EP load balancer integration) by @syuoni in #4615
- Add files into scan ignoreList by @yiqingy0 in #4663
- [Infra] - Multi-GPU testing support with Slurm by @yuanjingx87 in #4454
- fix disagg config params by @chuangz0 in #4646
- [Test] - Waive RTX Pro 6000 Slurm testing by @chzblych in #4672
- fix fmha v2 tests by @qsang-nv in #4661
- test: rcca https://nvbugs/5223130 by @xinhe-nv in #4510
- [NVBUG 5301980] Fix fp4 gemm padding. by @Tracin in #4662
- [Test] - Correct waive the Slurm test stage by @chzblych in #4677
- Chore: only pad one dummy request for attention dp scenario by @QiJune in #4664
- Waive L0 tests by @yiqingy0 in #4686
- feat: better build_wheel.py venv handling by @tongyuantongyu in #4525
- [Infra][TRTLLM-3929] Rerun failure tests by @yiqingy0 in #3264
- [AutoDeploy] Increased Model Coverage Mass Migration Week 1 by @lucaslie in #4468
- fix: fmha_v2 compilation by @PerkzZheng in #4659
- test: [CI] remove closed bugs by @xinhe-nv in #4638
- refactor: extract and reuse filter_weights. by @yuxianq in #4681
- fix: fix dsr1 min lat cga ar rate drop(0.2) by @yunruis in #4561
- Update the description for NGC docker images (#4671) by @MartinMarciniszyn in #4702
- feat: Add vanilla MOE. by @yuxianq in #4682
- Fix handle cancel request for attentionDP by @Shunkangz in #4648
- feat: Integration of Fused QKNorm+RoPE. by @bobboli in #4611
- [TRTLLM-1658][feat] Enable multiple response in trtllm-serve for TRT backend by @LinPoly in #4623
- doc: Document the docker release image on NGC by @MartinMarciniszyn in #4705
- Fix: hang on disagg when MNNVL two-shot AllReduce is enabled by @kaiyux in #4678
- Mass-integration 0.20 to main by @amirkl94 in #4577
- Add missing serialization classes by @Tabrizian in #4642
- Fix rerun step by @yiqingy0 in #4715
- feat: forward exceptions to Python and catch OOMs by @ixlmar in #4497
- chore [BREAKING CHANGE]: Flatten PyTorchConfig knobs into TorchLlmArgs by @Superjomn in #4603
- chore: remove extra paths to find binaries by @achartier in #4706
- test: [CI] Add failed cases into waives.txt by @xinhe-nv in #4688
- tests: [https://nvbugspro.nvidia.com/bug/5289908] run maverick bf16 on blackwell by @crazydemo in #4722
- chore: Clean up cpp runtime by @Funatiq in #4449
- chore: add -f to pkill calls by @achartier in #4711
- feat: support packed weights in vanilla moe by @yuxianq in #4719
- chore: [nvbug_5273941] unwaive test_llm_loading_from_ckpt_for_tp2 by @hchings in #4725
- feature: KV Cache GPUDirect Storage by @arthurrasmusson in #3209
- [fix] add back rtx6000pro tests by @yuanjingx87 in #4679
- chore: rename ExecutorBindingsWorker/Proxy by @Superjomn in #4716
- Waive L0 test by @yiqingy0 in #4748
- CI: move post-merge multi GPU test of PyTorch backend to H200 by @QiJune in #4733
- infra: [TRTLLM-5247][TRTLLM-5248][TRTLLM-5249] Refactor docker build image groovy and support NGC images by @ZhanruiSunCh in #4294
- test: remove perf test l40s/l20 oom test cases and unwaive tests by @ruodil in #4755
- fix: test trtllm-bench mgmn by @Superjomn in #4613
- [feat] add b200 support via slurm by @yuanjingx87 in #4709
- Chore: fuse _merge_requests method into _fetch_new_requests method by @QiJune in #4689
- [fix] Eagle-2 LLMAPI pybind argument fix. by @jhaotingc in #3967
- [feat] Support RULER + chunked prefill in lm-eval-harness by @mikeiovine in #4592
- refactor: unique_ptr instead of shared_ptr by @Funatiq in #4697
- Cherry pick feat/llama4 to main by @nv-yilinf in #4739
- [Architecture] Redesign Linear module by @hlu1 in #4721
- [perf] Reduce the workspace size of FP4 activation scales for MoE by @jinyangyuan-nvidia in #4303
- Added code owners for AutoDeploy by @juney-nvidia in #4769
- chore: fix llm_root when LLM_ROOT is not set by @achartier in #4741
- [JIRA-5226219][fix] Fix Bug in KV cache manager by @thorjohnsen in #4596
- test: skip test_llm_hf_gemma_quantization_1gpu_vswa on A100 by @xinhe-nv in #4779
- test: Waive test_llm_loading_from_ckpt_for_tp2 by @syuoni in #4797
- Fabric Memory for KV Cache Transfer by @chuangz0 in #4717
- fix: random fail of cache router test by @zhengd-nv in #4597
- feat: estimate GPU mem. usage w/ minimal KV cache by @ixlmar in #4574
- fix: iteration logging and typing in PyExecutor by @ixlmar in #4734
- [TRTLLM-5516] perf: replicate dummy request for cuda graph padding by @QiJune in #4729
- [feat] support sharegpt downloading in benchmark_serving by @LinPoly in #4578
- fix: [nvbugs/5310520] disable embed_tokens's TP when DP enabled for llama model. by @yuxianq in #4758
- DeepSeek R1 throughut optimization tech blog for Blackwell GPUs by @litaotju in #4791
- Expose new tech blog about DSR1 throughput optimization to the main R… by @juney-nvidia in #4803
- [fix] Fix Llama 3.3 70b EAGLE by @mikeiovine in #4772
- [Infra]Remove some old keyword by @EmmaQiaoCh in #4552
- opt: the perormance for dist-agg streaming generation by @Superjomn in #4214
- fix: re-enable tp/pp for quickstart_advanced.py. by @yuxianq in #4766
- [nvbug 5305210] Resolve nvbug 5305210 by @DomBrown in #4759
- fix: large-scale EP - EP load balancer with MTP layer and route offset by EP rank by @syuoni in #4767
- [TRTLLM-4987][feat] Support context logits in TRTLLMSampler by @dcampora in #4538
- [fix] Fix SamplingParams check on n and best_of by @syuoni in #4655
- Check test names in waive list by @EmmaQiaoCh in #4292
- [AutoDeploy] Increased Model Coverage Mass Migration Week 2 by @lucaslie in #4817
- CI: Performance regression tests update by @amirkl94 in #3531
- [TRTLLM-4783][feat] Mamba2 kernel updates for Nemotron-H by @tomeras91 in #4494
- 'entered copyBlock' format string expects %s, pass string rather than int by @netanel-haber in #4820
- fix: fix accuracy and illegal memory access issues when using mtp + attention dp by @lfr-0531 in #4379
- feat: large-scale EP(part 5: Static EP load balancer with offline statistics) by @syuoni in #4695
- [fix] Fix llama4 min-latency mode by @nv-yilinf in #4810
- [Infra] - Minor clean-up and test Ubuntu mirrors by @chzblych in #4829
- fix: [https://nvbugspro.nvidia.com/bug/5273945] Unwaive tests for bug-5273945 by @lfr-0531 in #4832
- [fix] Fix Llama4 guradwords failures by @nv-yilinf in #4844
- [TRTLLM-5502][infra] Add github action to identify if PR is from community by @poweiw in #4824
New Contributors
- @AdamzNV made their first contribution in #4425
- @nvrohanv made their first contribution in #4290
- @thorjohnsen made their first contribution in #3936
- @ruoqianguo made their first contribution in #4430
- @wu1du2 made their first contribution in #4410
- @CarstyYou made their first contribution in #4481
- @coldwaterq made their first contribution in #3869
- @rosenrodt made their first contribution in #4530
- @amemov made their first contribution in #4634
- @arthurrasmusson made their first contribution in #3209
- @jhaotingc made their first contribution in #3967
- @nv-yilinf made their first contribution in #4739
- @poweiw made their first contribution in #4824
Full Changelog: v0.20.0rc3...v0.21.0rc0