Release v0.20.0rc2 · NVIDIA/TensorRT-LLM

Highlights

Model Support
- Added support for Qwen3 (#4010)
Features
- Integrated Llama4 input processor (#3383)
- Added CGA reduction FHMA kernels on Blackwell (#3763)
- Implemented LogitsProcessor in PyTorch backend (#3145)
- Unfused attention for native support (#3668)
- Added group_rms_norm kernel to normalize multiple inputs in a single operator (#3438)
- Supported multiple LoRA adapters and TP (#3885)
API
- Introduced multimodal embedding field in LlmRequest (#3855)
- Enabled overriding CLI arguments with YAML file in trtllm-serve (#4164)
Bug Fixes
- Fixed bug related to creating CUDA stream as default parameter, which will be initialized during importing (#3764)
Benchmark
Performance
Infra
- Open-sourced XQA kernels (#3762)
Documentation
Known Issues

What's Changed

feat: llama4 multimodal input processor by @milesial in #3383
fix: [nvbug/5234873] Detect pmix and raise error when mpirun is not used. by @yuxianq in #3858
fix: fix bug of deepseek gropu_size setting by @byshiue in #3860
Infra: Remove empty junit xml by @EmmaQiaoCh in #3794
fix: Update num_of_ctx_tokens in iteration stats by @HuiGao-NV in #3785
cacheTransceiver buffer manager by @chuangz0 in #3798
fix: add warmup flag into py_executor to prevent enable profiler during wa… by @byshiue in #3852
fix: trtllm-bench build trt engine on slurm by @Superjomn in #3825
infra: install Triton in the base image by @Tabrizian in #3759
fix bug of create cuda stream as default parameter which will be init… by @byshiue in #3764
Test: waive intermittent test hang by @chzblych in #3894
[TRTLLM-4786] infra: add scaffolding paths to pytorch only files by @dc3671 in #3835
test: [CI] Add failed cases into waives.txt by @xinhe-nv in #3887
test: [CI] Add failed cases into waives.txt by @xinhe-nv in #3867
Fix the link of doc by @litaotju in #3903
[TRTLLM-4638 ][feat] add best of n support with reward model in scaffolding by @dc3671 in #3807
Add docs about DeepSeek-R1 long context support. by @qiaoxj07 in #3910
[https://nvbugs/5247300] fix(requirements): fix neither 'setup.py' nor 'pyproject.toml' found by @dc3671 in #3906
chore: Make llama4 MoE use maybe_execute_in_parallel by @mikeiovine in #3779
fix: Fixing minor typo in allreduce kernel selection by @hyukn in #3912
test: add deepseek v3 & r1 cases by @VALLIS-NERIA in #3528
[fix] Fix a few issues with EAGLE3 in PyTorch backend by @mikeiovine in #3686
waive test_attention_no_cache by @hchings in #3921
fix: Fix FMHA-based MLA in the generation phase and add MLA unit test by @jinyangyuan-nvidia in #3863
chore: remove DummyKvCacheManager. by @yuxianq in #3896
fix(test): remove random context seq lengths and set random seed by @qixiang-99 in #3919
feat: fix erros on scaffolding README by @WeiHaocheng in #3899
fix: [https://nvbugspro.nvidia.com/bug/5242406][fix] Fix fp8 kvcache support by @hlu1 in #3877
feat: add CGA reduction fmha kernels on Blackwell. by @PerkzZheng in #3763
[CI] increase H100 CI nodes for PyTorch only pipelines by @QiJune in #3927
[TRTLLM-4883][fix]: Update output speed calculation. by @FrankD412 in #3923
chore: add num_scheduled_requests into print_log by @byshiue in #3914
fix: revert #3858 by @yuxianq in #3928
chore: change log level of some text from info to debug by @byshiue in #3930
[fix] optimize cudaMemGetInfo for TllmGenFmhaRunner by @zhhuang-nv in #3907
chore: Mass integration of release/0.19 into main by @DomBrown in #3841
feat: parallel q_b_proj and concat by @hello-11 in #3917
refactor: (part1) Add contraints doc for fusedMoe module. by @HuiGao-NV in #3882
fix: get head_dim from model’s config. by @yuxianq in #3916
TRTLLM-4624 feat: Add nvfp4 gemm and moe support for SM120 by @VALLIS-NERIA in #3770
[feat] support ModelOpt NemotronH FP8 quantized checkpoints in TRTLLM pytorch flow by @tomeras91 in #3891
fix: change the seq_lens sync copy to an async one by @lfr-0531 in #3786
[https://nvbugs/5178445][fix] Skip blackwell tests for sm120 by @pamelap-nvidia in #3815
chore: skip pipeline parallelism test of pytorch flow by @QiJune in #3947
[TRTLLM-4623][fix] sync internal cutlass kernel changes by @pamelap-nvidia in #3968
chore: update multi-gpu trigger file list by @QiJune in #3971
test: [CI] remove closed bugs by @xinhe-nv in #3890
chore: Remove duplicated get_sm_version. by @yuxianq in #3935
chore: bump version to 0.20.0rc2 by @ZhanruiSunCh in #3949
perf: Optimise MOE prologue to use fused setup function by @djns99 in #3790
chore: remove release branch codeowners from main by @tburt-nv in #3954
fix: [https://nvbugspro.nvidia.com/bug/5243482] If FlashMLA is used, the existence of FMHA based MLA kernels should not be checked. by @bobboli in #3862
unwaive disagg tests by @chuangz0 in #3925
infra: open source XQA kernels by @ming-wei in #3762
feat: Mistral-Large-2 support in the Pytorch workflow by @hypdeb in #3845
chore: update internal_cutlass_kernels. by @nv-guomingz in #3973
[fix] Pad requests to maximum draft length in spec decode by @mikeiovine in #3957
infra: add conan by @tburt-nv in #3744
waive test_tinyllama_guided_decoding by @hchings in #3997
[TRTLLM-4460] test: Use Llama 3.2 1B for Llama C++ tests by @DomBrown in #3206
refactor: Clean up allreduce module for Deepseek V3 model by @hyukn in #3829
[feat]: Allow for a settable end-of-sequence/padding token in max throughput benchmark. by @FrankD412 in #3776
feat: Add multimodal embedding field in LlmRequest by @katec846 in #3855
Llama4 processor fixes by @milesial in #3994
fix: Add attention workspace memory check by @hlu1 in #3970
feat: add relaxed acceptance for DS by @yweng0828 in #3865
fix:https://nvbugs/5246733 by @nv-guomingz in #3989
model: support Qwen3 by @byshiue in #4010
test: [CI] Add failed cases into waives.txt by @xinhe-nv in #3943
feat: Support Top-K logprobs and prompt_logprobs in LLMAPI by @hchings in #3388
[AutoDeploy] Make all ranks agree on kv-cache size by @suyoggupta in #4007
feat: LogitsProcessor in PyTorch backend by @hchings in #3145
fix: Fallback to NCCL for various patterns when input size is large. by @hyukn in #4009
feat: [AutoDeploy] unfusing attention by @lucaslie in #3668
feat: Add group_rms_norm kernel to normalize multiple inputs in a single operator. by @SimengLiu-nv in #3438
model/infra: add ci and doc for qwen3 by @byshiue in #4022
[Deepseek][fix] Fix Deepseek MTP with moe_backend=TRTLLM by @hlu1 in #4001
fix: Move all casters to customCasters. by @dcampora in #3945
[fix] [nvbug/5252057] Fix kv cache reuse on PyTorch multimodal by @yechank-nvidia in #4025
fix: Correctly sizes seqslotmanager considering pp. by @dcampora in #3984
[infra] Improve llama4 parallelism test coverage by @mikeiovine in #3821
feat: add Pytorch support of Vision Encoder for multimodal models by @qixiang-99 in #3791
[fix] keep using system python for dev install by @tburt-nv in #4014
refactor: Move ModelSpec to core library by @Funatiq in #3980
infra: Remove the WAR for test items incompletely by @EmmaQiaoCh in #3313
refactor: Introduce MpiTag enumeration and update MPI function signatures by @Funatiq in #3893
chore: refactor llmapi e2e tests by @Superjomn in #3803
Chore: 2025-04-29 CI allowlist update by @tburt-nv in #3969
feat: support to trace executor loop. by @yuxianq in #3983
fix: [nvbug/5241627] Fix AllReduce kernel hang issue when both tp and pp are enabled. by @hyukn in #3988
[Infra] Waive L0 tests by @yiqingy0 in #4051
fix: apply rope twice in Qwen3. by @yuxianq in #4040
fix: instantiate decoder early in pytorch by @dcampora in #4029
feat: run mmlu and summarize without engine_dir. by @yuxianq in #4056
[Test]: Waive unsupported tests by @chzblych in #4059
fix: request termination in pipeline parallelism by @Funatiq in #3892
[Test]: Clean up stale waives by @chzblych in #4062
test: Add disaggregated serving accuracy tests by @Tabrizian in #4036
[nvbug/5248986][fix] Skip debugCheckSemaphores in stream capture mode by @mikeiovine in #4032
test: Test OOB access issue in penaltyKernel for endId=-1 by @brb-nv in #4035
feat: add deepseek-r1 reasoning parser to trtllm-serve by @pansicheng in #3354
Fix: fix bug of qwen3 moe by @byshiue in #4058
doc: update qwen3 document by @byshiue in #4073
[AutoDeploy][perf] Further optimize flashinfer backend in AutoDeploy by @suyoggupta in #4024
[fix] Loosen the thresholds of test_attention_mla by @jinyangyuan-nvidia in #4074
feat: support add internal cutlass kernels as subproject by @tongyuantongyu in #3658
fix[nvbug5245262]: skip add new slot if request has slot 0 by @HuiGao-NV in #3991
fix: [nvbug/5251968] Fix NVLink version decoding. by @yuxianq in #3996
[https://nvbugs/5257681] fix: draft/target probs shape by @Funatiq in #4055
infra: [TRTLLM-4475][TRTLLM-4565] Add pipeline hierarchy and basic info in the Jenkins job page by @ZhanruiSunCh in #3859
fix: trtllm-serve hang in stress test and ds v3 stress parameter update by @dominicshanshan in #3836
[TRTLLM-3429] feat: Overlap scheduling in C++ runtime by @Funatiq in #3625
fix: Properly get decoding mode according to same logic as cpp. by @dcampora in #4026
chore: cleanup llmapi for 1.0 by @hchings in #4039
TorchLLM: Pass local dir to processor creation by @milesial in #4018
test(perf): Add Llama-3.1-Nemotron-Nano-8B-v1 to QA Perf Tests by @venkywonka in #3822
bench: TRTLLM-4936 Port benchmark_serving.py by @kaiyux in #4011
fix cache transfer buffer by @chuangz0 in #3942
[TRTLLM-3925, https://nvbugs/5245262] [fix] Normalize LLM.generate API by @syuoni in #3985
[Qwen3] chore: fix bug of fused_moe on tp > 1 by @byshiue in #4093
[TRTLLM-5057][fix] Adding option to specify a set of token ids for multimodal tokens by @rakib-hasan in #4107
chore: Cleanup deprecated APIs from LLM-API (part 1/2) by @Superjomn in #3732
[Infra] - Update code ownership rules by @chzblych in #4109
tests: skip writing prepare_dataset output to logs, and add llama_v3.1_8b_fp8, llama_v3.3_70b_fp8, llama_v3.1_405b_fp4 models by @ruodil in #3864
[https://nvbugspro.nvidia.com/bug/5246419][fix] Align default setting & remove unnecessary check for chat and completion by @LinPoly in #3888
infra: [TRTLLM-4051] Support only run some backend type test by @ZhanruiSunCh in #3578
chore:update .gitignore for doc building task. by @nv-guomingz in #3993
enh: Update docker Makefile to use only the visible GPUs of machine by @venkywonka in #4097
feat: Reduce branch overhead in groupRMSNorm kernels by @SimengLiu-nv in #4067
[Deepseek] Refactor Deepseek Decoder layer by @hlu1 in #4016
[feat/] enable attention DP in Llama4 maverick model - part 1 by @zihaok in #4065
test: add INTEGRATION_TEST env var to speed up integration test by @crazydemo in #3618
[Infra] - Update code ownership rules for public APIs by @chzblych in #4122
chore: remove data stage in serve example on slurm by @Superjomn in #4138
test: Waive test_llm cases by @syuoni in #4136
test: Waive disagg accuracy test by @syuoni in #4124
infra: WAR for Argument list too long of globalVars[CACHED_CHANGED_FILE_LIST] by @ZhanruiSunCh in #4131
feat: Add Slurm support and enable RTX Pro 6000 testing pipeline in CI by @yuanjingx87 in #4019
[Infra] Waive L0 flaky test by @yiqingy0 in #4148
doc: TRTLLM-4797 Update perf-analysis.md by @kaiyux in #4100
Fix TP8 for NVFP4 kv dupilcation. by @Tracin in #4143
test: [CI] remove closed bugs by @xinhe-nv in #4046
[TRTQA-2861][test]: add nemotron and llama4 cases into qa test by @crazydemo in #4053
chore: enhance the cmake experience by ignoring the additional semicolon by @nv-guomingz in #3992
[TRTLLM-4480][doc] Documentation for new accuracy test suite and trtllm-eval by @syuoni in #3946

New Contributors

@milesial made their first contribution in #3383
@hello-11 made their first contribution in #3917
@tomeras91 made their first contribution in #3891
@djns99 made their first contribution in #3790
@venkywonka made their first contribution in #3822
@zihaok made their first contribution in #4065
@yuanjingx87 made their first contribution in #4019

Full Changelog: v0.20.0rc1...v0.20.0rc2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v0.20.0rc2

Highlights

What's Changed

New Contributors

Contributors

Uh oh!