Skip to content

v0.20.0rc2

Pre-release
Pre-release
Compare
Choose a tag to compare
@Shixiaowei02 Shixiaowei02 released this 13 May 09:27
· 594 commits to main since this release
74df12b

Highlights

  • Model Support
    • Added support for Qwen3 (#4010)
  • Features
    • Integrated Llama4 input processor (#3383)
    • Added CGA reduction FHMA kernels on Blackwell (#3763)
    • Implemented LogitsProcessor in PyTorch backend (#3145)
    • Unfused attention for native support (#3668)
    • Added group_rms_norm kernel to normalize multiple inputs in a single operator (#3438)
    • Supported multiple LoRA adapters and TP (#3885)
  • API
    • Introduced multimodal embedding field in LlmRequest (#3855)
    • Enabled overriding CLI arguments with YAML file in trtllm-serve (#4164)
  • Bug Fixes
    • Fixed bug related to creating CUDA stream as default parameter, which will be initialized during importing (#3764)
  • Benchmark
  • Performance
  • Infra
    • Open-sourced XQA kernels (#3762)
  • Documentation
  • Known Issues

What's Changed

  • feat: llama4 multimodal input processor by @milesial in #3383
  • fix: [nvbug/5234873] Detect pmix and raise error when mpirun is not used. by @yuxianq in #3858
  • fix: fix bug of deepseek gropu_size setting by @byshiue in #3860
  • Infra: Remove empty junit xml by @EmmaQiaoCh in #3794
  • fix: Update num_of_ctx_tokens in iteration stats by @HuiGao-NV in #3785
  • cacheTransceiver buffer manager by @chuangz0 in #3798
  • fix: add warmup flag into py_executor to prevent enable profiler during wa… by @byshiue in #3852
  • fix: trtllm-bench build trt engine on slurm by @Superjomn in #3825
  • infra: install Triton in the base image by @Tabrizian in #3759
  • fix bug of create cuda stream as default parameter which will be init… by @byshiue in #3764
  • Test: waive intermittent test hang by @chzblych in #3894
  • [TRTLLM-4786] infra: add scaffolding paths to pytorch only files by @dc3671 in #3835
  • test: [CI] Add failed cases into waives.txt by @xinhe-nv in #3887
  • test: [CI] Add failed cases into waives.txt by @xinhe-nv in #3867
  • Fix the link of doc by @litaotju in #3903
  • [TRTLLM-4638 ][feat] add best of n support with reward model in scaffolding by @dc3671 in #3807
  • Add docs about DeepSeek-R1 long context support. by @qiaoxj07 in #3910
  • [https://nvbugs/5247300] fix(requirements): fix neither 'setup.py' nor 'pyproject.toml' found by @dc3671 in #3906
  • chore: Make llama4 MoE use maybe_execute_in_parallel by @mikeiovine in #3779
  • fix: Fixing minor typo in allreduce kernel selection by @hyukn in #3912
  • test: add deepseek v3 & r1 cases by @VALLIS-NERIA in #3528
  • [fix] Fix a few issues with EAGLE3 in PyTorch backend by @mikeiovine in #3686
  • waive test_attention_no_cache by @hchings in #3921
  • fix: Fix FMHA-based MLA in the generation phase and add MLA unit test by @jinyangyuan-nvidia in #3863
  • chore: remove DummyKvCacheManager. by @yuxianq in #3896
  • fix(test): remove random context seq lengths and set random seed by @qixiang-99 in #3919
  • feat: fix erros on scaffolding README by @WeiHaocheng in #3899
  • fix: [https://nvbugspro.nvidia.com/bug/5242406][fix] Fix fp8 kvcache support by @hlu1 in #3877
  • feat: add CGA reduction fmha kernels on Blackwell. by @PerkzZheng in #3763
  • [CI] increase H100 CI nodes for PyTorch only pipelines by @QiJune in #3927
  • [TRTLLM-4883][fix]: Update output speed calculation. by @FrankD412 in #3923
  • chore: add num_scheduled_requests into print_log by @byshiue in #3914
  • fix: revert #3858 by @yuxianq in #3928
  • chore: change log level of some text from info to debug by @byshiue in #3930
  • [fix] optimize cudaMemGetInfo for TllmGenFmhaRunner by @zhhuang-nv in #3907
  • chore: Mass integration of release/0.19 into main by @DomBrown in #3841
  • feat: parallel q_b_proj and concat by @hello-11 in #3917
  • refactor: (part1) Add contraints doc for fusedMoe module. by @HuiGao-NV in #3882
  • fix: get head_dim from model’s config. by @yuxianq in #3916
  • TRTLLM-4624 feat: Add nvfp4 gemm and moe support for SM120 by @VALLIS-NERIA in #3770
  • [feat] support ModelOpt NemotronH FP8 quantized checkpoints in TRTLLM pytorch flow by @tomeras91 in #3891
  • fix: change the seq_lens sync copy to an async one by @lfr-0531 in #3786
  • [https://nvbugs/5178445][fix] Skip blackwell tests for sm120 by @pamelap-nvidia in #3815
  • chore: skip pipeline parallelism test of pytorch flow by @QiJune in #3947
  • [TRTLLM-4623][fix] sync internal cutlass kernel changes by @pamelap-nvidia in #3968
  • chore: update multi-gpu trigger file list by @QiJune in #3971
  • test: [CI] remove closed bugs by @xinhe-nv in #3890
  • chore: Remove duplicated get_sm_version. by @yuxianq in #3935
  • chore: bump version to 0.20.0rc2 by @ZhanruiSunCh in #3949
  • perf: Optimise MOE prologue to use fused setup function by @djns99 in #3790
  • chore: remove release branch codeowners from main by @tburt-nv in #3954
  • fix: [https://nvbugspro.nvidia.com/bug/5243482] If FlashMLA is used, the existence of FMHA based MLA kernels should not be checked. by @bobboli in #3862
  • unwaive disagg tests by @chuangz0 in #3925
  • infra: open source XQA kernels by @ming-wei in #3762
  • feat: Mistral-Large-2 support in the Pytorch workflow by @hypdeb in #3845
  • chore: update internal_cutlass_kernels. by @nv-guomingz in #3973
  • [fix] Pad requests to maximum draft length in spec decode by @mikeiovine in #3957
  • infra: add conan by @tburt-nv in #3744
  • waive test_tinyllama_guided_decoding by @hchings in #3997
  • [TRTLLM-4460] test: Use Llama 3.2 1B for Llama C++ tests by @DomBrown in #3206
  • refactor: Clean up allreduce module for Deepseek V3 model by @hyukn in #3829
  • [feat]: Allow for a settable end-of-sequence/padding token in max throughput benchmark. by @FrankD412 in #3776
  • feat: Add multimodal embedding field in LlmRequest by @katec846 in #3855
  • Llama4 processor fixes by @milesial in #3994
  • fix: Add attention workspace memory check by @hlu1 in #3970
  • feat: add relaxed acceptance for DS by @yweng0828 in #3865
  • fix:https://nvbugs/5246733 by @nv-guomingz in #3989
  • model: support Qwen3 by @byshiue in #4010
  • test: [CI] Add failed cases into waives.txt by @xinhe-nv in #3943
  • feat: Support Top-K logprobs and prompt_logprobs in LLMAPI by @hchings in #3388
  • [AutoDeploy] Make all ranks agree on kv-cache size by @suyoggupta in #4007
  • feat: LogitsProcessor in PyTorch backend by @hchings in #3145
  • fix: Fallback to NCCL for various patterns when input size is large. by @hyukn in #4009
  • feat: [AutoDeploy] unfusing attention by @lucaslie in #3668
  • feat: Add group_rms_norm kernel to normalize multiple inputs in a single operator. by @SimengLiu-nv in #3438
  • model/infra: add ci and doc for qwen3 by @byshiue in #4022
  • [Deepseek][fix] Fix Deepseek MTP with moe_backend=TRTLLM by @hlu1 in #4001
  • fix: Move all casters to customCasters. by @dcampora in #3945
  • [fix] [nvbug/5252057] Fix kv cache reuse on PyTorch multimodal by @yechank-nvidia in #4025
  • fix: Correctly sizes seqslotmanager considering pp. by @dcampora in #3984
  • [infra] Improve llama4 parallelism test coverage by @mikeiovine in #3821
  • feat: add Pytorch support of Vision Encoder for multimodal models by @qixiang-99 in #3791
  • [fix] keep using system python for dev install by @tburt-nv in #4014
  • refactor: Move ModelSpec to core library by @Funatiq in #3980
  • infra: Remove the WAR for test items incompletely by @EmmaQiaoCh in #3313
  • refactor: Introduce MpiTag enumeration and update MPI function signatures by @Funatiq in #3893
  • chore: refactor llmapi e2e tests by @Superjomn in #3803
  • Chore: 2025-04-29 CI allowlist update by @tburt-nv in #3969
  • feat: support to trace executor loop. by @yuxianq in #3983
  • fix: [nvbug/5241627] Fix AllReduce kernel hang issue when both tp and pp are enabled. by @hyukn in #3988
  • [Infra] Waive L0 tests by @yiqingy0 in #4051
  • fix: apply rope twice in Qwen3. by @yuxianq in #4040
  • fix: instantiate decoder early in pytorch by @dcampora in #4029
  • feat: run mmlu and summarize without engine_dir. by @yuxianq in #4056
  • [Test]: Waive unsupported tests by @chzblych in #4059
  • fix: request termination in pipeline parallelism by @Funatiq in #3892
  • [Test]: Clean up stale waives by @chzblych in #4062
  • test: Add disaggregated serving accuracy tests by @Tabrizian in #4036
  • [nvbug/5248986][fix] Skip debugCheckSemaphores in stream capture mode by @mikeiovine in #4032
  • test: Test OOB access issue in penaltyKernel for endId=-1 by @brb-nv in #4035
  • feat: add deepseek-r1 reasoning parser to trtllm-serve by @pansicheng in #3354
  • Fix: fix bug of qwen3 moe by @byshiue in #4058
  • doc: update qwen3 document by @byshiue in #4073
  • [AutoDeploy][perf] Further optimize flashinfer backend in AutoDeploy by @suyoggupta in #4024
  • [fix] Loosen the thresholds of test_attention_mla by @jinyangyuan-nvidia in #4074
  • feat: support add internal cutlass kernels as subproject by @tongyuantongyu in #3658
  • fix[nvbug5245262]: skip add new slot if request has slot 0 by @HuiGao-NV in #3991
  • fix: [nvbug/5251968] Fix NVLink version decoding. by @yuxianq in #3996
  • [https://nvbugs/5257681] fix: draft/target probs shape by @Funatiq in #4055
  • infra: [TRTLLM-4475][TRTLLM-4565] Add pipeline hierarchy and basic info in the Jenkins job page by @ZhanruiSunCh in #3859
  • fix: trtllm-serve hang in stress test and ds v3 stress parameter update by @dominicshanshan in #3836
  • [TRTLLM-3429] feat: Overlap scheduling in C++ runtime by @Funatiq in #3625
  • fix: Properly get decoding mode according to same logic as cpp. by @dcampora in #4026
  • chore: cleanup llmapi for 1.0 by @hchings in #4039
  • TorchLLM: Pass local dir to processor creation by @milesial in #4018
  • test(perf): Add Llama-3.1-Nemotron-Nano-8B-v1 to QA Perf Tests by @venkywonka in #3822
  • bench: TRTLLM-4936 Port benchmark_serving.py by @kaiyux in #4011
  • fix cache transfer buffer by @chuangz0 in #3942
  • [TRTLLM-3925, https://nvbugs/5245262] [fix] Normalize LLM.generate API by @syuoni in #3985
  • [Qwen3] chore: fix bug of fused_moe on tp > 1 by @byshiue in #4093
  • [TRTLLM-5057][fix] Adding option to specify a set of token ids for multimodal tokens by @rakib-hasan in #4107
  • chore: Cleanup deprecated APIs from LLM-API (part 1/2) by @Superjomn in #3732
  • [Infra] - Update code ownership rules by @chzblych in #4109
  • tests: skip writing prepare_dataset output to logs, and add llama_v3.1_8b_fp8, llama_v3.3_70b_fp8, llama_v3.1_405b_fp4 models by @ruodil in #3864
  • [https://nvbugspro.nvidia.com/bug/5246419][fix] Align default setting & remove unnecessary check for chat and completion by @LinPoly in #3888
  • infra: [TRTLLM-4051] Support only run some backend type test by @ZhanruiSunCh in #3578
  • chore:update .gitignore for doc building task. by @nv-guomingz in #3993
  • enh: Update docker Makefile to use only the visible GPUs of machine by @venkywonka in #4097
  • feat: Reduce branch overhead in groupRMSNorm kernels by @SimengLiu-nv in #4067
  • [Deepseek] Refactor Deepseek Decoder layer by @hlu1 in #4016
  • [feat/] enable attention DP in Llama4 maverick model - part 1 by @zihaok in #4065
  • test: add INTEGRATION_TEST env var to speed up integration test by @crazydemo in #3618
  • [Infra] - Update code ownership rules for public APIs by @chzblych in #4122
  • chore: remove data stage in serve example on slurm by @Superjomn in #4138
  • test: Waive test_llm cases by @syuoni in #4136
  • test: Waive disagg accuracy test by @syuoni in #4124
  • infra: WAR for Argument list too long of globalVars[CACHED_CHANGED_FILE_LIST] by @ZhanruiSunCh in #4131
  • feat: Add Slurm support and enable RTX Pro 6000 testing pipeline in CI by @yuanjingx87 in #4019
  • [Infra] Waive L0 flaky test by @yiqingy0 in #4148
  • doc: TRTLLM-4797 Update perf-analysis.md by @kaiyux in #4100
  • Fix TP8 for NVFP4 kv dupilcation. by @Tracin in #4143
  • test: [CI] remove closed bugs by @xinhe-nv in #4046
  • [TRTQA-2861][test]: add nemotron and llama4 cases into qa test by @crazydemo in #4053
  • chore: enhance the cmake experience by ignoring the additional semicolon by @nv-guomingz in #3992
  • [TRTLLM-4480][doc] Documentation for new accuracy test suite and trtllm-eval by @syuoni in #3946

New Contributors

Full Changelog: v0.20.0rc1...v0.20.0rc2