v0.20.0rc2
Pre-release
Pre-release
Highlights
- Model Support
- Added support for Qwen3 (#4010)
- Features
- Integrated Llama4 input processor (#3383)
- Added CGA reduction FHMA kernels on Blackwell (#3763)
- Implemented
LogitsProcessor
in PyTorch backend (#3145) - Unfused attention for native support (#3668)
- Added
group_rms_norm
kernel to normalize multiple inputs in a single operator (#3438) - Supported multiple LoRA adapters and TP (#3885)
- API
- Bug Fixes
- Fixed bug related to creating CUDA stream as default parameter, which will be initialized during importing (#3764)
- Benchmark
- Performance
- Infra
- Open-sourced XQA kernels (#3762)
- Documentation
- Known Issues
What's Changed
- feat: llama4 multimodal input processor by @milesial in #3383
- fix: [nvbug/5234873] Detect pmix and raise error when mpirun is not used. by @yuxianq in #3858
- fix: fix bug of deepseek gropu_size setting by @byshiue in #3860
- Infra: Remove empty junit xml by @EmmaQiaoCh in #3794
- fix: Update num_of_ctx_tokens in iteration stats by @HuiGao-NV in #3785
- cacheTransceiver buffer manager by @chuangz0 in #3798
- fix: add warmup flag into py_executor to prevent enable profiler during wa… by @byshiue in #3852
- fix: trtllm-bench build trt engine on slurm by @Superjomn in #3825
- infra: install Triton in the base image by @Tabrizian in #3759
- fix bug of create cuda stream as default parameter which will be init… by @byshiue in #3764
- Test: waive intermittent test hang by @chzblych in #3894
- [TRTLLM-4786] infra: add scaffolding paths to pytorch only files by @dc3671 in #3835
- test: [CI] Add failed cases into waives.txt by @xinhe-nv in #3887
- test: [CI] Add failed cases into waives.txt by @xinhe-nv in #3867
- Fix the link of doc by @litaotju in #3903
- [TRTLLM-4638 ][feat] add best of n support with reward model in scaffolding by @dc3671 in #3807
- Add docs about DeepSeek-R1 long context support. by @qiaoxj07 in #3910
- [https://nvbugs/5247300] fix(requirements): fix neither 'setup.py' nor 'pyproject.toml' found by @dc3671 in #3906
- chore: Make llama4 MoE use maybe_execute_in_parallel by @mikeiovine in #3779
- fix: Fixing minor typo in allreduce kernel selection by @hyukn in #3912
- test: add deepseek v3 & r1 cases by @VALLIS-NERIA in #3528
- [fix] Fix a few issues with EAGLE3 in PyTorch backend by @mikeiovine in #3686
- waive test_attention_no_cache by @hchings in #3921
- fix: Fix FMHA-based MLA in the generation phase and add MLA unit test by @jinyangyuan-nvidia in #3863
- chore: remove DummyKvCacheManager. by @yuxianq in #3896
- fix(test): remove random context seq lengths and set random seed by @qixiang-99 in #3919
- feat: fix erros on scaffolding README by @WeiHaocheng in #3899
- fix: [https://nvbugspro.nvidia.com/bug/5242406][fix] Fix fp8 kvcache support by @hlu1 in #3877
- feat: add CGA reduction fmha kernels on Blackwell. by @PerkzZheng in #3763
- [CI] increase H100 CI nodes for PyTorch only pipelines by @QiJune in #3927
- [TRTLLM-4883][fix]: Update output speed calculation. by @FrankD412 in #3923
- chore: add num_scheduled_requests into print_log by @byshiue in #3914
- fix: revert #3858 by @yuxianq in #3928
- chore: change log level of some text from info to debug by @byshiue in #3930
- [fix] optimize cudaMemGetInfo for TllmGenFmhaRunner by @zhhuang-nv in #3907
- chore: Mass integration of release/0.19 into main by @DomBrown in #3841
- feat: parallel q_b_proj and concat by @hello-11 in #3917
- refactor: (part1) Add contraints doc for fusedMoe module. by @HuiGao-NV in #3882
- fix: get head_dim from model’s config. by @yuxianq in #3916
- TRTLLM-4624 feat: Add nvfp4 gemm and moe support for SM120 by @VALLIS-NERIA in #3770
- [feat] support ModelOpt NemotronH FP8 quantized checkpoints in TRTLLM pytorch flow by @tomeras91 in #3891
- fix: change the seq_lens sync copy to an async one by @lfr-0531 in #3786
- [https://nvbugs/5178445][fix] Skip blackwell tests for sm120 by @pamelap-nvidia in #3815
- chore: skip pipeline parallelism test of pytorch flow by @QiJune in #3947
- [TRTLLM-4623][fix] sync internal cutlass kernel changes by @pamelap-nvidia in #3968
- chore: update multi-gpu trigger file list by @QiJune in #3971
- test: [CI] remove closed bugs by @xinhe-nv in #3890
- chore: Remove duplicated get_sm_version. by @yuxianq in #3935
- chore: bump version to 0.20.0rc2 by @ZhanruiSunCh in #3949
- perf: Optimise MOE prologue to use fused setup function by @djns99 in #3790
- chore: remove release branch codeowners from main by @tburt-nv in #3954
- fix: [https://nvbugspro.nvidia.com/bug/5243482] If FlashMLA is used, the existence of FMHA based MLA kernels should not be checked. by @bobboli in #3862
- unwaive disagg tests by @chuangz0 in #3925
- infra: open source XQA kernels by @ming-wei in #3762
- feat: Mistral-Large-2 support in the Pytorch workflow by @hypdeb in #3845
- chore: update internal_cutlass_kernels. by @nv-guomingz in #3973
- [fix] Pad requests to maximum draft length in spec decode by @mikeiovine in #3957
- infra: add conan by @tburt-nv in #3744
- waive test_tinyllama_guided_decoding by @hchings in #3997
- [TRTLLM-4460] test: Use Llama 3.2 1B for Llama C++ tests by @DomBrown in #3206
- refactor: Clean up allreduce module for Deepseek V3 model by @hyukn in #3829
- [feat]: Allow for a settable end-of-sequence/padding token in max throughput benchmark. by @FrankD412 in #3776
- feat: Add multimodal embedding field in LlmRequest by @katec846 in #3855
- Llama4 processor fixes by @milesial in #3994
- fix: Add attention workspace memory check by @hlu1 in #3970
- feat: add relaxed acceptance for DS by @yweng0828 in #3865
- fix:https://nvbugs/5246733 by @nv-guomingz in #3989
- model: support Qwen3 by @byshiue in #4010
- test: [CI] Add failed cases into waives.txt by @xinhe-nv in #3943
- feat: Support Top-K logprobs and prompt_logprobs in LLMAPI by @hchings in #3388
- [AutoDeploy] Make all ranks agree on kv-cache size by @suyoggupta in #4007
- feat: LogitsProcessor in PyTorch backend by @hchings in #3145
- fix: Fallback to NCCL for various patterns when input size is large. by @hyukn in #4009
- feat: [AutoDeploy] unfusing attention by @lucaslie in #3668
- feat: Add group_rms_norm kernel to normalize multiple inputs in a single operator. by @SimengLiu-nv in #3438
- model/infra: add ci and doc for qwen3 by @byshiue in #4022
- [Deepseek][fix] Fix Deepseek MTP with moe_backend=TRTLLM by @hlu1 in #4001
- fix: Move all casters to customCasters. by @dcampora in #3945
- [fix] [nvbug/5252057] Fix kv cache reuse on PyTorch multimodal by @yechank-nvidia in #4025
- fix: Correctly sizes seqslotmanager considering pp. by @dcampora in #3984
- [infra] Improve llama4 parallelism test coverage by @mikeiovine in #3821
- feat: add Pytorch support of Vision Encoder for multimodal models by @qixiang-99 in #3791
- [fix] keep using system python for dev install by @tburt-nv in #4014
- refactor: Move ModelSpec to core library by @Funatiq in #3980
- infra: Remove the WAR for test items incompletely by @EmmaQiaoCh in #3313
- refactor: Introduce MpiTag enumeration and update MPI function signatures by @Funatiq in #3893
- chore: refactor llmapi e2e tests by @Superjomn in #3803
- Chore: 2025-04-29 CI allowlist update by @tburt-nv in #3969
- feat: support to trace executor loop. by @yuxianq in #3983
- fix: [nvbug/5241627] Fix AllReduce kernel hang issue when both tp and pp are enabled. by @hyukn in #3988
- [Infra] Waive L0 tests by @yiqingy0 in #4051
- fix: apply rope twice in Qwen3. by @yuxianq in #4040
- fix: instantiate decoder early in pytorch by @dcampora in #4029
- feat: run mmlu and summarize without engine_dir. by @yuxianq in #4056
- [Test]: Waive unsupported tests by @chzblych in #4059
- fix: request termination in pipeline parallelism by @Funatiq in #3892
- [Test]: Clean up stale waives by @chzblych in #4062
- test: Add disaggregated serving accuracy tests by @Tabrizian in #4036
- [nvbug/5248986][fix] Skip debugCheckSemaphores in stream capture mode by @mikeiovine in #4032
- test: Test OOB access issue in penaltyKernel for endId=-1 by @brb-nv in #4035
- feat: add deepseek-r1 reasoning parser to trtllm-serve by @pansicheng in #3354
- Fix: fix bug of qwen3 moe by @byshiue in #4058
- doc: update qwen3 document by @byshiue in #4073
- [AutoDeploy][perf] Further optimize flashinfer backend in AutoDeploy by @suyoggupta in #4024
- [fix] Loosen the thresholds of test_attention_mla by @jinyangyuan-nvidia in #4074
- feat: support add internal cutlass kernels as subproject by @tongyuantongyu in #3658
- fix[nvbug5245262]: skip add new slot if request has slot 0 by @HuiGao-NV in #3991
- fix: [nvbug/5251968] Fix NVLink version decoding. by @yuxianq in #3996
- [https://nvbugs/5257681] fix: draft/target probs shape by @Funatiq in #4055
- infra: [TRTLLM-4475][TRTLLM-4565] Add pipeline hierarchy and basic info in the Jenkins job page by @ZhanruiSunCh in #3859
- fix: trtllm-serve hang in stress test and ds v3 stress parameter update by @dominicshanshan in #3836
- [TRTLLM-3429] feat: Overlap scheduling in C++ runtime by @Funatiq in #3625
- fix: Properly get decoding mode according to same logic as cpp. by @dcampora in #4026
- chore: cleanup llmapi for 1.0 by @hchings in #4039
- TorchLLM: Pass local dir to processor creation by @milesial in #4018
- test(perf): Add Llama-3.1-Nemotron-Nano-8B-v1 to QA Perf Tests by @venkywonka in #3822
- bench: TRTLLM-4936 Port benchmark_serving.py by @kaiyux in #4011
- fix cache transfer buffer by @chuangz0 in #3942
- [TRTLLM-3925, https://nvbugs/5245262] [fix] Normalize LLM.generate API by @syuoni in #3985
- [Qwen3] chore: fix bug of fused_moe on tp > 1 by @byshiue in #4093
- [TRTLLM-5057][fix] Adding option to specify a set of token ids for multimodal tokens by @rakib-hasan in #4107
- chore: Cleanup deprecated APIs from LLM-API (part 1/2) by @Superjomn in #3732
- [Infra] - Update code ownership rules by @chzblych in #4109
- tests: skip writing prepare_dataset output to logs, and add llama_v3.1_8b_fp8, llama_v3.3_70b_fp8, llama_v3.1_405b_fp4 models by @ruodil in #3864
- [https://nvbugspro.nvidia.com/bug/5246419][fix] Align default setting & remove unnecessary check for chat and completion by @LinPoly in #3888
- infra: [TRTLLM-4051] Support only run some backend type test by @ZhanruiSunCh in #3578
- chore:update .gitignore for doc building task. by @nv-guomingz in #3993
- enh: Update docker Makefile to use only the visible GPUs of machine by @venkywonka in #4097
- feat: Reduce branch overhead in groupRMSNorm kernels by @SimengLiu-nv in #4067
- [Deepseek] Refactor Deepseek Decoder layer by @hlu1 in #4016
- [feat/] enable attention DP in Llama4 maverick model - part 1 by @zihaok in #4065
- test: add INTEGRATION_TEST env var to speed up integration test by @crazydemo in #3618
- [Infra] - Update code ownership rules for public APIs by @chzblych in #4122
- chore: remove data stage in serve example on slurm by @Superjomn in #4138
- test: Waive test_llm cases by @syuoni in #4136
- test: Waive disagg accuracy test by @syuoni in #4124
- infra: WAR for Argument list too long of globalVars[CACHED_CHANGED_FILE_LIST] by @ZhanruiSunCh in #4131
- feat: Add Slurm support and enable RTX Pro 6000 testing pipeline in CI by @yuanjingx87 in #4019
- [Infra] Waive L0 flaky test by @yiqingy0 in #4148
- doc: TRTLLM-4797 Update perf-analysis.md by @kaiyux in #4100
- Fix TP8 for NVFP4 kv dupilcation. by @Tracin in #4143
- test: [CI] remove closed bugs by @xinhe-nv in #4046
- [TRTQA-2861][test]: add nemotron and llama4 cases into qa test by @crazydemo in #4053
- chore: enhance the cmake experience by ignoring the additional semicolon by @nv-guomingz in #3992
- [TRTLLM-4480][doc] Documentation for new accuracy test suite and trtllm-eval by @syuoni in #3946
New Contributors
- @milesial made their first contribution in #3383
- @hello-11 made their first contribution in #3917
- @tomeras91 made their first contribution in #3891
- @djns99 made their first contribution in #3790
- @venkywonka made their first contribution in #3822
- @zihaok made their first contribution in #4065
- @yuanjingx87 made their first contribution in #4019
Full Changelog: v0.20.0rc1...v0.20.0rc2