Releases · NVIDIA/TensorRT-LLM

11 Jun 05:27

nv-guomingz

v0.21.0rc1

9c012d5

v0.21.0rc1 Pre-release

Pre-release

Highlights

Model Support
- Add HyperCLOVAX-SEED-Vision support for PyTorch flow (#4799)
Features
- Support generation logits in TRTLLM Sampler (#4819）
- Support for large scale-EP(#4818)
- Support XQA-based MLA on SM120 (#4858)
- Add PositionEmbeddingType=0 to xqa support (#4934)
- Add cache reuse support (selective cache transfer) in mla cache formatter (#4749)
- Update DeepSeek FP8 TRT-LLM Gen cubins (#4643)
- Add heuristics for checkpoint files prefetching (#4765)
- Enable NVFP4 output for TRTLLM attention kernels (#4737)
- Refactor Fused MoE (#4790)
- Add integration of etcd (#3738)
- Memoize weight shuffle index to speed up weight preproc in moe_backend=TRTLLM (#4826)
- Enable Disaggregated serving for QWen-3 (#4929)
API
- Set _AutoDeployLlmArgs as primary config object (#4891)
Bug Fixes
- Fix warmup phase batch size out of range (#4986)
- Fix buffer count (#5007)
- Fix nvbug 5324252 test_resource_manager.py broken (#4925)
- Fix nvbug 5280806 2 model spec decode flow (#4807)
- Fix nvbug 5324248 test_pytorch_model_engine.py broken (#4973)
- Fix cuda graph padding for spec decoding (#4853)
- Correct the order of llm request state (#4781)
- Handle OOMs during KV cache estimation (#4690)
- Only pass fast_build=true to non-pytorch backend (#4920)
- Fix the no fusion all reduce hanging (#4594)
- Deprecate AutoDeploy CI post-merge tests and keep them for local testing (#4892)
- Fix nvbug 5302895 test_trtllm_bench_llmapi_launch fail(#4835)
- Fix llama 4 long context issue (#4809)
- Fix nvbug 5300080 the bug of setting attention_chunk_size and enable
- chunked-attention in the generation-phase by default (#4693)
- Fix nvbug 5294316 queued request stats (#4714)
- Fix max_num_sequences calculation with overlap scheduling (#4532)
- Fix trtllm-bench hang issue due to LLM API IPC (#4798)
- Fix a pd+mtp accuracy issue (#4536)
Benchmark
- Add beam width to low latency. (#4812)
- Fix trtllm-bench iter_stats and cuda_graph_batch_sizes errors. (#4827)
Performance
Infrastructure
- TRT-LLM team formally releases docker image on NGC.
- Update jnlp version in container image (#4944)
- Upgraded ModelOpt to 0.31.0 (#5003)
- Upgrade Flash-infer to 0.2.5 (#5004)
Documentation
- doc: Document the docker release image on NGC #4705
- Fix readme for disaggregated serving (#4846)
- Fix draft target README and set exclude_input_in_output to False (#4882)
- blog: Scaling Expert Parallelism in TensorRT-LLM (Part 1: Design and Implementation of Large-scale EP) #4958
Known Issues

What's Changed

upgrade cutlass to 4.0 by @yunruis in #4794
[feat] Enable NVFP4 output for TRTLLM attention kernels by @Tom-Zheng in #4737
[https://nvbugs/5271281][fix] fix a pd+mtp accuracy issue by @lfr-0531 in #4536
fix [nvbug5256044]: bench hang due to llmapi ipc by @Superjomn in #4798
[nvbugs/5303555] ci: unwaive test_fp8_block_scales_cuda_graph_padding by @Funatiq in #4735
fix: remove the accuracy assert on run_majority_vote_aime24.py #5340 by @WeiHaocheng in #4784
feat: add heuristics for checkpoint files prefetching. by @yuxianq in #4765
tests: [TRTQA-2905] improve timeout report for qa test cases by @crazydemo in #4753
shorten reqs in con:1 cases and add streaming cases, and add l2 perf … by @ruodil in #4849
Add pre-merge Triton backend tests by @Tabrizian in #4842
[Architecture] Refactor FusedMoE by @hlu1 in #4790
fix: max_num_sequences calculation with overlap scheduling by @Funatiq in #4532
refactor: Separate DecoderState from GptDecoderBatched by @Funatiq in #4700
[enhanchment] Add beam width to low latency. by @FrankD412 in #4812
fix: Register MoeLoadBalancerConfig to serialization.py by @syuoni in #4864
feat: Add integration of etcd by @Shunkangz in #3738
[nvbug 5294316] fix: Fix queued request stats by @pcastonguay in #4714
chore: remove request_error ipc in LLM.submit by @Superjomn in #4763
[Doc] Fix readme for disaggregated serving by @arekay in #4846
chore: Waive examples/test_mistral.py::test_llm_mistral_v1_1gpu. by @SimengLiu-nv in #4873
[Arch] Freeze model_config by @hlu1 in #4814
[TRTLLM-5053] Refactoring and Unifying the Multimodal input preparation by @rakib-hasan in #4506
feat: update DeepSeek FP8 TRT-LLM Gen cubins by @nekorobov in #4643
[https://nvbugspro.nvidia.com/bug/5300080] Fix the bug of setting attention_chunk_size and enable chunked-attention in the generation-phase by default by @PerkzZheng in #4693
[fix] Fix llama 4 long context by @mikeiovine in #4809
Replace memset with data initialization within kernels by @ChristinaZ in #4851
Refactor the first token response in PD by @Shunkangz in #4692
Fix: NVBug 5302895 by @Shixiaowei02 in #4835
feat: cache reuse support (selective cache transfer) in mla cache formatter by @zhengd-nv in #4749
feat: Enhance AutoTuner inference path and code readability by @hyukn in #4466
Chore: refine comments of prepare inputs method of model engine by @QiJune in #4837
fix: build_config in TorchLlmArgs and avoid invalid args by @Superjomn in #4600
chore: Mass integration of release/0.20. by @omera-nv in #4871
[TRTLLM-4923][feat] Paged mamba cache by @tomeras91 in #4822
chore: bump version to 0.21.0rc1 by @ZhanruiSunCh in #4896
Fix: draft target README and set exclude_input_in_output to False by @eagle705 in #4882
fix: correct the order of llm request state by @zhengd-nv in #4781
fix: trtllm-bench iter_stats and cuda_graph_batch_sizes error errors. by @qiaoxj07 in #4827
chore: introduce KvCacheCreator by @ixlmar in #4581
tests: Update gb200 test case by @yizhang-nv in #4754
fix: Fix broken vanilla moe since FusedMoE refactor. by @yuxianq in #4897
fix: LLM invalid arg in a test by @Superjomn in #4922
[AutoDeploy] deprecate CI post-merge tests and keep them for local testing by @lucaslie in #4892
[infra] Unwaive unittests/_torch by @mikeiovine in #4919
[TRTLLM-4647][fix] Fix the no fusion allreduce hanging by @timlee0212 in #4594
tests: fix 5273697 by @xinhe-nv in #4685
Waive L0 tests by @yiqingy0 in #4927
Only pass fast_build=true to non-pytorch backend by @netanel-haber in #4920
tests: [TRTQA-2906] add benchmark serving tests by @xinhe-nv in #4901
fix: handle OOMs during KV cache estimation by @ixlmar in #4690
CI: waive test_llm_get_queued_stats by @QiJune in #4945
[AutoDeploy] _AutoDeployLlmArgs as primary config object by @lucaslie in #4891
Revert "[infra] Unwaive unittests/_torch" by @QiJune in #4950
Revert "fix: build_config in TorchLlmArgs and avoid invalid args" by @QiJune in #4949
[TRTLLM-5630] restore free_gpu_memory_fraction=0.9 in tests by @ixlmar in #4859
Add disaggregated unittest by @Shunkangz in #4899
Waive L0 tests by @yiqingy0 in #4953
fix a bug of global cuda graph dummy request by @QiJune in #4894
Fix: fix autodeploy by @QiJune in #4957
feat : add PositionEmbeddingType=0 to xqa support by @dongjiyingdjy in #4934
update fmha_v2 by @qsang-nv in #4895
blog: Scaling Expert Parallelism in TensorRT-LLM (Part 1: Design and Implementation of Large-scale EP) by @kaiyux in https://github....

Contributors

Superjomn, rakib-hasan, and 54 other contributors

Assets 2

04 Jun 02:48

Shixiaowei02

v0.21.0rc0

9ae2ce6

v0.21.0rc0 Pre-release

Pre-release

Highlights

Model Support
Features
- Support for large-scale EP (#4384, #4495 , #4615)
- Added chunked attention kernels (#4291, #4394)
- ScaffoldingLLM now supports MCP (#4410)
- Integrated NIXL into the communication layer of the disaggregated service (#3934, #4125)
- Integrated Hopper chunked attention kernels (#4330)
- Enabled TRT backend for Python runtime in disaggregated service (#4243)
- Added FP8 block-scale GEMM support on SM89 (#4481)
- Qwen3 FP4 MoE TRTLLM backend for low-latency (#4530)
- Introduced sliding-window attention kernels for the generation phase on Blackwell (#4564)
- Vanilla MOE added (#4682)
- Fused QKNorm + RoPE integration (#4611)
- Fabric Memory support for KV Cache Transfer (#4717)
API
Bug Fixes
- Resolved Torch compile issue for DeepSeek V3 (#3952)
- Fixed trtllm-llmapi-launch for single-node, single-GPU setups (#4428)
- Removed duplicate tokenization in generation server (#4492)
- Fixed cancel request handling for attentionDP (#4648)
- Fixed disaggregated service hang when MNNVL two-shot AllReduce is enabled (#4678)
- Fixed queued request statistics (#4806)
- Fixed EP load balancer with MTP layer and route offset by EP rank (#4767)
- Resolved accuracy and illegal memory access issues with MTP + attention DP (#4379)
Benchmark
- Added all_reduce.py benchmark script for testing (#4537)
Performance
Infrastructure
- Integrated NGC image into Makefile automation and documentation (#4400)
- Built Triton for ARM architecture (#4456)
- Added triton release container (#4455)
- Refactored Docker build image (Groovy) and added NGC image support (#4294)
- Upgraded Cutlass to version 4.0 (#4794)
Documentation
- Updated descriptions for NGC Docker images (#4702, #4705)
Known Issues
- Two important fixes are NOT included in this release, but they are already in the main branch
  - Fix the bug of setting attention_chunk_size and enable chunked-attention in the generation-phase by default (#4693)
  - Fixed the failure of the LLMAPI benchmark caused by a serialization issue (#4835)

What's Changed

Refine doc by @juney-nvidia in #4420
Refine doc by @juney-nvidia in #4421
refine doc by @juney-nvidia in #4422
Remove vila test by @Tabrizian in #4376
[TRTLLM-4618][feat] Add Nemotron Super 49B FP8 test on RTX6000 Pro (SM120) by @farazkh80 in #4363
tests: add qa test mentioned in docs by @crazydemo in #4357
[Infra] - Always push the release images in the post-merge job by @chzblych in #4426
tests: Add test cases for rcca cases by @crazydemo in #4347
chore: cleanup perf_evaluator code by @Superjomn in #3833
feat: Add pp support for hybrid attn/mamba model by @yuxianq in #4358
fix: wrong argument name enable_overlap_scheduler by @kaiyux in #4433
Update "Roadmap" link under README.md to the issues with Roadmap label by @AdamzNV in #4425
fix potential issues in allreduce fusion kernel and ut by @yilin-void in #4226
[TRTLLM-4638] feat(scaffolding): update Reward Controller to PRM specific controller with step split by @dc3671 in #4337
feat: NIXL interface integration by @Shixiaowei02 in #3934
Downgrade the logger level for fallback tactic warning. by @hyukn in #4440
Test: Improve model re-use in C++ DGX tests for CI stability by @DomBrown in #4263
fix: temp disable the problem test by @Shixiaowei02 in #4445
Add llama4 disagg accuracy tests by @Tabrizian in #4336
[https://nvbugs/5123103][fix] Fix torch compile for DeepSeekV3 by @liji-nv in #3952
[Docs] - Reapply #4220 by @chzblych in #4434
[TRTLLM-4618][feat] Fix cutlass MoE GEMM fallback failure on FP8 + add e2e test for Mixtral 8x7B FP8 on RTX6000 Pro (SM120) by @farazkh80 in #4335
[Feat] add chunked-attention kernels on Hopper (for llama4) by @PerkzZheng in #4291
test(perf): Add some Llama-3_3-Nemotron-Super-49B-v1 integration-perf-tests (TRT flow, trtllm-bench) by @venkywonka in #4128
fix: [nvbugs/5287097] Align PP layer distribution between pytorch and TRT flow. by @yuxianq in #4399
feat: Low Precision Allreduce for PCIe based GPU by @kanghui0204 in #4344
test: [CI] Add failed cases into waives.txt by @xinhe-nv in #4429
[TRTLLM-4932] Add CLI accuracy tests for Llama-3.3-70B-Instruct and LLM API BF16 variant by @moraxu in #4362
test: update test filter in perf test yml file to select cases by gpu name and add cases for RTX 6000 pro by @ruodil in #4282
[AutoDeploy] HF factory improvements by @lucaslie in #4371
chore: bump version to 0.21.0rc0 by @ZhanruiSunCh in #4465
doc: [TRTLLM-325]Integrate the NGC image in Makefile automation and document by @MartinMarciniszyn in #4400
infra: Add qwen3 235B tests into QA by @byshiue in #4483
feat: large-scale EP(part 2: MoE Load Balancer - core utilities) by @dongxuy04 in #4384
[TRTLLM-5085][fix] Nemotron H correctness test by @tomeras91 in #4444
[Docs] - Add date and commit info by @chzblych in #4448
fix[nvbug/5286515]: trtllm-llmapi-launch on single node single gpu by @Superjomn in #4428
fix: replace the image links in the blog by @Shixiaowei02 in #4489
fix: Fix TRTLLMSampler beam width bug. by @dcampora in #4473
refactor: Unify request order in TRT and PyTorch workflow by @Funatiq in #4096
[TRTLLM-5273]feat/Use full attention mask if Llama3 is used as encoder and fix EarlyStopDecoder unsqueeze bug by @nvrohanv in #4290
Build Triton for arm by @Tabrizian in #4456
test: [CI] remove closed bugs by @xinhe-nv in #4417
test(perf): Add remaining Phi-4-mini-instruct perf tests by @venkywonka in #4443
feat: conditional disaggregation in disagg server by @zhengd-nv in #3974
perf: Fuse gemm setup function for SM90/SM100 MOE plugin path by @djns99 in #4146
fix: skip weights defined in create_weights for pp. by @yuxianq in #4447
Feat: add chunked-attention kernels on Blackwell by @PerkzZheng in #4394
fix [nvbug/5220766]: llmapi-launch add add trtllm-bench test with engine building by @Superjomn in #4091
[TRTLLM-5000][feat] Pytorch implementation of ngram drafter by @thorjohnsen in #3936
test: NIXL single process test by @Shixiaowei02 in #4486
Chore: waive torch compile test cases of deepseek v3 lite by @QiJune in #4508
Feat: add deep_gemm swapab Kernel by @ruoqianguo in #4430
unwaive some disagg test by @chuangz0 in #4476
Clean: fmha codes by @PerkzZheng in #4496
tests: add llama 3.3 70b 2 nodes tests by @xinhe-nv in #4391
CI: waive test_fp8_block_scales_4gpus of deepseek v3 lite by @QiJune in #4520
test: remove enable_overlap_schedule in pytorch config and set enable_chunked prefill to be true for isl>2048 cases by @ruodil in #4285
docs: update the introduction for scaffolding by @WeiHaocheng in #4360
test: add failed case in waive list and fix some test script issue for perf test by @ruodil in #4527
tests: add qwene fp4 tests into QA test list & update sanity test list by @xinhe-nv in #4478
feat: large-scale EP(part 3: refactor - FusedMoe for redundant expert) by @dongxuy04 in #4495
refactor: DisaggExecutorTest by @Funatiq in #4398
chore: clean ucx and nixl mirror. by @nv-guomingz in h...

Contributors

Superjomn, coldwaterq, and 76 other contributors

Assets 2

20 May 09:42

Shixiaowei02

v0.20.0rc3

039f7e3

v0.20.0rc3 Pre-release

Pre-release

Highlights

Model Support
- Support Mistral Small 3.1 24B VLM in TRT workflow (#4183)
- Support Gemma3-1b-it in PyTorch workflow (#3999)
Features
- Adopt new logprob definition in PyTorch flow (#4057)
- Support multiple LoRA adapters and TP (#3885)
- Add Piecewise CUDA Graph support (#3804)
- Add KV cache-aware router for disaggregated serving (#3831)
- Enable per-request stats with PyTorch backend (#4156)
- Support DeepSeek-R1 W4A8 on Hopper (#4123)
- Enable chunked context for FlashInfer (#4132)
- Support KV cache reuse for MLA (#3571)
API
- Allow overriding CLI arguments with YAML file in trtllm-serve (#4164)
- Remove deprecated GptSession/V1 from TRT workflow (#4092)
Bug Fixes
- Fix attention DP bug on Qwen3 MoE model (#4141)
- Fix illegal memory access when running LLaMA 4 with CUDA Graph enabled (#4101)
Benchmark
- Remove deprecated Python runtime benchmark (#4171)
- Add benchmark support for scaffolding (#4286)
Performance
Infrastructure
- The base Docker image for TensorRT-LLM is updated to nvcr.io/nvidia/pytorch:25.04-py3 (#4049)
- The dependent TensorRT version is updated to 10.10.0 (#4049)
- The dependent CUDA version is updated to 12.9.0 (#4049)
- The dependent public PyTorch version is updated to 2.7.0.
- The pre-built TensorRT-LLM wheel on PyPI is linked against PyTorch 2.7.0 now, which uses the CXX11 ABI (#4235)
Documentation
Known Issues

What's Changed

feat: adopt new logprob definition in PyTorch flow by @tongyuantongyu in #4057
infra: Add NIXL into the Dockerfile by @Shixiaowei02 in #3981
feat: support multi lora adapters and TP by @shaharmor98 in #3885
feat: Fallback to NCCL for various patterns when input size is large. by @hyukn in #4080
Cherry-pick trtllm-gen from feat/llama4 to main by @chenfeiz0326 in #4086
[fix] [AutoDeploy] flashinfer usage on H100 by @lucaslie in #4162
fix: Fix incorrect conversion of Gen TPS/user by @FrankD412 in #4112
[fix] Fix llama4 + eagle3 by @mikeiovine in #3998
Support RingAttention in the BertAttention plugin and the DiT model by @ChunhuanLin in #3661
fix: alltoall padding for chunked MoE by @dongxuy04 in #4157
[feat] Allow overriding cli args with yaml file in trtllm-serve by @pcastonguay in #4164
[TRTLLM-5147][Qwen3] fix: fix bug of attention dp on qwen3_moe model by @byshiue in #4141
chore: Clean up the legacy DeepseekAllreudceFusionOp. by @hyukn in #4081
test: add qwen3 and disaggregated serving accuracy tests to qa test list by @StanleySun639 in #4083
[TRTLLM-3105][feat] Add Piecewise CUDA Graph Support by @yizhang-nv in #3804
fix: change pp broadcast pattern for LPs by @hchings in #4130
[#4085][fix] Fix apply_per_channel_scale for extremely large input sequence length. by @StudyingShao in #4089
[nvbug/5262268][fix] Fix trtllm-bench for llama 4 by @mikeiovine in #4104
chore: Fix pipeline break caused by previous PR (#4081) rebase + pipeline reuse by @hyukn in #4169
[https://nvbugspro.nvidia.com/bug/5260676]test: skip fp8 quantization case for pre-ada by @crazydemo in #4095
test: move mistral / mixtral test cases in QA test list into the new accuracy test suite by @crazydemo in #3440
test: Add fp8kv to DS-v3-lite integration tests. by @bobboli in #3950
[fix] Fix relaxed acceptance to support enabling it in context phase by @lfr-0531 in #4126
test: skip tests on b200 by @xinhe-nv in #3913
infra: Fix pipeline step error in post merge by @ZhanruiSunCh in #3948
fix: library path of nixl by @Shixiaowei02 in #4184
test: amend default pytorch extra-llm-api-config.yml in perf test by @ruodil in #4176
[fix] Fix add_dummy_requests for spec decoding cases by @lfr-0531 in #4084
test: [CI] Add failed cases into waives.txt by @xinhe-nv in #4165
feat: support task collection for to collect information (#3328) by @WeiHaocheng in #3824
Cherry-pick: Use multi-threading to load MoE expert weights by @chenfeiz0326 in #4137
test: amend regex match for perf throughput by @ruodil in #4186
chore: reduce size of the docker images by @MartinMarciniszyn in #3990
[fix] trtllm-gen mla kernel warnings by @zhhuang-nv in #4119
chore: Deprecate evaltool by @Tracin in #4173
[fix][nvbug/5244009] Fix llama 4 test lists/scout accuracy issue by @mikeiovine in #4069
perf: [TRTLLM-4717][perf] Set CUDA graph max batch size and padding in throughput benchmark. by @FrankD412 in #3875
Refactor: Restructure C++ tests for better modularisation of non-shared code by @DomBrown in #4027
Updating the multimodal models README to add steps for running phi-4-multimodal instruct by @mayani-nv in #3932
fix: draft target README and assertion for logits-based acceptance by @mayani-nv in #4167
Add initial list of CODEOWNERS by @kevinch-nv in #4105
chore: PR to fix the formatting errors by @mayani-nv in #4200
test: Remove CNN Dailymail tasks in favor of GSM8K by @syuoni in #4187
[CI] waive two multi-gpu test cases by @QiJune in #4206
[CI] update pytorch only file list by @QiJune in #4210
chore:update modelopt to 0.29 by @nv-guomingz in #4150
[Infra] Waive L0 test by @yiqingy0 in #4212
remove cache_transceiver_prealloc_size by @chuangz0 in #4153
[TRTQA-2802][fix]: add --host for mgmn serve examples script by @xinhe-nv in #4175
tests: https://nvbugs/5219534 remove failed tests from test list by @xinhe-nv in #4113
test: add llama_3.2_1B model and fix for test lora script issue by @ruodil in #4139
chore: Update CODEOWNERS by @Funatiq in #4221
[https://nvbugspro.nvidia.com/bug/5270564][test] skip per-hopper for llama4 by @crazydemo in #4211
[TRTLLM-4911] feat(scaffolding): make sampling_params only setable by controller by @dc3671 in #4151
Feat: support exporting softmax statistics and update the kernel-selection heuristic by @PerkzZheng in #4155
infra: [TRTLLM-325] Prepare for NGC release - multiplatform build by @MartinMarciniszyn in #4191
[feat] Support HyperCLOVAX-SEED-Text language part by @yechank-nvidia in #3902
feat: Support the Structural Tag in guided decoding by @Ubospica in #4066
feat: add kv cache aware router by @zhengd-nv in #3831
refactor: Allow models to override apply_qk_norm. by @yuxianq in #4078
[https://nvbugs/5214229] [fix] Unwaive lm_head quantization case by @syuoni in #4222
doc: update switcher.json config by @niukuo in #4220
Revert "Add initial list of CODEOWNERS (#4105)" by @Funatiq in #4234
[TRTLLM-5188] fix: [AutoDeploy] update output shape of prepare_fused_mha_metadata_fake by @Fridah-nv in #4199
fix: Reset planned states to avoid memory leak in TrtllmAttentionWrapper by @yuxianq in #4227
Feat: Variable-Beam-Width-Search (VBWS) part4 by @wili-65535 in #3979
[TRTLLM-5081] [test] Align parametrize_with_ids to the pytest behavior by @syuoni in #4090
fix: reshape token_ids for lp in torch backend by @hchings in #4239
feat: Add heuristic for GroupRMSNorm kernel selection. by @SimengLiu-nv in https://github.com/NV...

Contributors

dcampora, rakib-hasan, and 75 other contributors

Assets 2

09 May 12:55

kaiyux

v0.19.0

c6f7d42

v0.19.0 Latest

Latest

TensorRT-LLM Release 0.19.0

Key Features and Enhancements

The C++ runtime is now open sourced.
PyTorch workflow
- Added DeepSeek V3/R1 support. Refer to examples/deepseek_v3/README.md, also to the blog docs/source/blogs/Best_perf_practice_on_DeepSeek-R1_in_TensorRT-LLM.md.
- Added Llava-Next support.
- Added BERT support.
- Added a C++ based decoder, which added support for:
  - TopK / TopP.
  - Bad words.
  - Stop words.
  - Embedding bias.
- Added Autotuner for custom-op-compatible tuning process.
  - Added a Python-based Autotuner core framework for kernel tuning.
  - Applied the Autotuner to fused MoE and NVFP4 linear operators for concept and performance evaluations.
- Added guided decoding support (XGrammar integration).
- Added pipeline parallelism support for the overlap scheduler in PyExecutor.
- Added Qwen2VL model support.
- Added mixed precision quantization support.
- Added pipeline parallelism with attention DP support.
- Added no-cache attention support.
- Added PeftCacheManager support.
- Added Qwen2.5‑VL support and refactored Qwen2‑VL.
- Added trtllm‑gen FP4 GEMM support.
- Added Qwen2 MoE support.
- Applied AutoTuner to both Fused MoE and NVFP4 Linear operators.
- Introduced a UserBuffers allocator.
- Added Deepseek eager mode AllReduce fusion support.
- Added Multi-Token Prediction (MTP) support. Refer to the “Multi-Token Prediction (MTP)” section of examples/deepseek_v3/README.md.
- Added FlashMLA support for SM90.
- Added support for enabling MTP with CUDA graph padding.
- Added initial EAGLE-3 implementation.
- Added support for FP8 MLA on NVIDIA Hopper and Blackwell GPUs.
AutoDeploy for PyTorch workflow.
- The AutoDeploy for PyTorch workflow is an experimental feature in tensorrt_llm._torch.auto_deploy.
- AutoDeploy provides an automated path from off-the-shelf models to optimized deployment in the TensorRT-LLM runtime.
- Check out examples/auto_deploy/README.md for more details.
LLM API
- [BREAKING CHANGE] Added dynamic logits processor support, and deprecated static logits processor.
- Added batched logits processor support.
- Added EAGLE support.
- Added abort request support.
- Added get_stats support.
- Added multi-node support for Slurm-based clusters, refer to examples/llm-api/llm_mgmn_*.sh.
Added InternLM-XComposer2 support. Refer to “InternLM-XComposer2” section in examples/multimodal/README.md.
Added INT4-AWQ support for MoE models. Refer to the “AWQ Quantization” section in examples/mixtral/README.md.
Added Qwen2-Audio support. Refer to examples/qwen2audio/README.md.
Added Language-Adapter support. Refer to examples/language_adapter/README.md.
Added STDiT for OpenSoRA text-to-video support. Refer to examples/stdit/README.md.
Added vision encoders with tensor parallelism and context parallelism support. Refer to examples/vit/README.md.
Added EXAONE-Deep support. Refer to examples/exaone/README.md.
Added support for Phi-4-mini and Phi‑4‑MM.
Added Gemma3 text‑only model support. Refer to "Run Gemma 3" section at examples/gemma/README.md.
Added FP8 quantization support for Qwen2-VL.
Added batched inference support for the LLM API MMLU example examples/mmlu_llmapi.py.
Added FP4 quantization-layernorm fusion plugin support. (Llama models only)
Added Mamba-Hybrid support.
Added NVILA video support. The support includes 1 prompt - N media and N prompt - N media batching modes.
Added a --quantize_lm_head option examples/quantization/quantize.py to support lm_head quantization.
Added batched tensor FP4 quantization support.
Added a /metrics endpoint for trtllm-serve to log iteration statistics.
Added LoRA support for Phi-2 model.
Added returning context logits support for trtllm-serve.
Added one-shot version for UserBuffer AllReduce-Normalization on FP16/BF16.
Added request BW metric measurement for disaggServerBenchmark.
Updated logits bitmask kernel to v3.
Enabled CUDA graphs when attention DP was used and active requests on different GPUs were uneven.
Added iteration log support for trtllm-bench.
fp8_blockscale_gemm is now open-sourced.
Added AWQ support for ModelOpt checkpoints.
Added Linear block scale layout support in FP4 quantization.
Added pre-quantized FP8 checkpoint support for Nemotron-mini-4b-instruct.
Added Variable-Beam-Width-Search (VBWS) support (part2).
Added LoRA support for Gemma.
Refactored scaffolding worker, added OpenAI API worker support.
Optionally split MoE inputs into chunks to reduce GPU memory usage.
Added UCX IP interface support.
[BREAKING CHANGE] Added output of first token to additional generation outputs.
Added FP8 support for SM120 architecture.
Registered ENABLE_MULTI_DEVICE and ENABLE_UCX as CMake options.
Made the scaffolding Controller more generic.
Breaking change: Added individual gatherContext support for each additional output.
Enabled PyExecutor inference flow to estimate max_num_tokens for kv_cache_manager.
Added TLLM_OVERRIDE_LAYER_NUM and TLLM_TRACE_MODEL_FORWARD environment variables for debugging.
Supported aborting disconnected requests.
Added an option to run disaggregated serving without context servers.
Fixed and improved allreduce and fusion kernels.
Enhanced the integrated robustness of scaffolding via init.py.

API Changes

Exposed kc_cache_retention_config from C++ executor API to the LLM API.
Moved BuildConfig arguments to LlmArgs.
Removed speculative decoding parameters from stateful decoders.
Exposed DecoderState via bindings and integrated it in decoder.
Refactored the LlmArgs with Pydantic and migrated remaining pybinding configurations to Python.
Refactored disaggregated serving scripts.
Added numNodes to ParallelConfig.
Redesigned the multi‑stream API for DeepSeek.

Fixed Issues

Fixed misused length argument of PluginField. This also fixes #2685.
Fixed a Llama-3.2 SmoothQuant convert checkpoint issue. (#2677)
Fixed a bug when loading an engine using LoRA through the LLM API. (#2782)
Fixed incorrect batch slot usage in addCumLogProbs kernel.
Fixed incorrect output for Llama-3.2-11B-Vision-Instruct. (#2796)
Removed the necessary of --extra-index-url https://pypi.nvidia.com when running pip install tensorrt-llm.

Infrastructure Changes

The dependent NVIDIA ModelOpt version is updated to 0.27.

Known Issues

The PyTorch workflow on SBSA is incompatible with bare metal environments like Ubuntu 24.04. Please use the PyTorch NGC Container for optimal support on SBSA platforms.

Assets 2

0 Join discussion

13 May 09:27

Shixiaowei02

v0.20.0rc2

74df12b

v0.20.0rc2 Pre-release

Pre-release

Highlights

Model Support
- Added support for Qwen3 (#4010)
Features
- Integrated Llama4 input processor (#3383)
- Added CGA reduction FHMA kernels on Blackwell (#3763)
- Implemented LogitsProcessor in PyTorch backend (#3145)
- Unfused attention for native support (#3668)
- Added group_rms_norm kernel to normalize multiple inputs in a single operator (#3438)
- Supported multiple LoRA adapters and TP (#3885)
API
- Introduced multimodal embedding field in LlmRequest (#3855)
- Enabled overriding CLI arguments with YAML file in trtllm-serve (#4164)
Bug Fixes
- Fixed bug related to creating CUDA stream as default parameter, which will be initialized during importing (#3764)
Benchmark
Performance
Infra
- Open-sourced XQA kernels (#3762)
Documentation
Known Issues

What's Changed

feat: llama4 multimodal input processor by @milesial in #3383
fix: [nvbug/5234873] Detect pmix and raise error when mpirun is not used. by @yuxianq in #3858
fix: fix bug of deepseek gropu_size setting by @byshiue in #3860
Infra: Remove empty junit xml by @EmmaQiaoCh in #3794
fix: Update num_of_ctx_tokens in iteration stats by @HuiGao-NV in #3785
cacheTransceiver buffer manager by @chuangz0 in #3798
fix: add warmup flag into py_executor to prevent enable profiler during wa… by @byshiue in #3852
fix: trtllm-bench build trt engine on slurm by @Superjomn in #3825
infra: install Triton in the base image by @Tabrizian in #3759
fix bug of create cuda stream as default parameter which will be init… by @byshiue in #3764
Test: waive intermittent test hang by @chzblych in #3894
[TRTLLM-4786] infra: add scaffolding paths to pytorch only files by @dc3671 in #3835
test: [CI] Add failed cases into waives.txt by @xinhe-nv in #3887
test: [CI] Add failed cases into waives.txt by @xinhe-nv in #3867
Fix the link of doc by @litaotju in #3903
[TRTLLM-4638 ][feat] add best of n support with reward model in scaffolding by @dc3671 in #3807
Add docs about DeepSeek-R1 long context support. by @qiaoxj07 in #3910
[https://nvbugs/5247300] fix(requirements): fix neither 'setup.py' nor 'pyproject.toml' found by @dc3671 in #3906
chore: Make llama4 MoE use maybe_execute_in_parallel by @mikeiovine in #3779
fix: Fixing minor typo in allreduce kernel selection by @hyukn in #3912
test: add deepseek v3 & r1 cases by @VALLIS-NERIA in #3528
[fix] Fix a few issues with EAGLE3 in PyTorch backend by @mikeiovine in #3686
waive test_attention_no_cache by @hchings in #3921
fix: Fix FMHA-based MLA in the generation phase and add MLA unit test by @jinyangyuan-nvidia in #3863
chore: remove DummyKvCacheManager. by @yuxianq in #3896
fix(test): remove random context seq lengths and set random seed by @qixiang-99 in #3919
feat: fix erros on scaffolding README by @WeiHaocheng in #3899
fix: [https://nvbugspro.nvidia.com/bug/5242406][fix] Fix fp8 kvcache support by @hlu1 in #3877
feat: add CGA reduction fmha kernels on Blackwell. by @PerkzZheng in #3763
[CI] increase H100 CI nodes for PyTorch only pipelines by @QiJune in #3927
[TRTLLM-4883][fix]: Update output speed calculation. by @FrankD412 in #3923
chore: add num_scheduled_requests into print_log by @byshiue in #3914
fix: revert #3858 by @yuxianq in #3928
chore: change log level of some text from info to debug by @byshiue in #3930
[fix] optimize cudaMemGetInfo for TllmGenFmhaRunner by @zhhuang-nv in #3907
chore: Mass integration of release/0.19 into main by @DomBrown in #3841
feat: parallel q_b_proj and concat by @hello-11 in #3917
refactor: (part1) Add contraints doc for fusedMoe module. by @HuiGao-NV in #3882
fix: get head_dim from model’s config. by @yuxianq in #3916
TRTLLM-4624 feat: Add nvfp4 gemm and moe support for SM120 by @VALLIS-NERIA in #3770
[feat] support ModelOpt NemotronH FP8 quantized checkpoints in TRTLLM pytorch flow by @tomeras91 in #3891
fix: change the seq_lens sync copy to an async one by @lfr-0531 in #3786
[https://nvbugs/5178445][fix] Skip blackwell tests for sm120 by @pamelap-nvidia in #3815
chore: skip pipeline parallelism test of pytorch flow by @QiJune in #3947
[TRTLLM-4623][fix] sync internal cutlass kernel changes by @pamelap-nvidia in #3968
chore: update multi-gpu trigger file list by @QiJune in #3971
test: [CI] remove closed bugs by @xinhe-nv in #3890
chore: Remove duplicated get_sm_version. by @yuxianq in #3935
chore: bump version to 0.20.0rc2 by @ZhanruiSunCh in #3949
perf: Optimise MOE prologue to use fused setup function by @djns99 in #3790
chore: remove release branch codeowners from main by @tburt-nv in #3954
fix: [https://nvbugspro.nvidia.com/bug/5243482] If FlashMLA is used, the existence of FMHA based MLA kernels should not be checked. by @bobboli in #3862
unwaive disagg tests by @chuangz0 in #3925
infra: open source XQA kernels by @ming-wei in #3762
feat: Mistral-Large-2 support in the Pytorch workflow by @hypdeb in #3845
chore: update internal_cutlass_kernels. by @nv-guomingz in #3973
[fix] Pad requests to maximum draft length in spec decode by @mikeiovine in #3957
infra: add conan by @tburt-nv in #3744
waive test_tinyllama_guided_decoding by @hchings in #3997
[TRTLLM-4460] test: Use Llama 3.2 1B for Llama C++ tests by @DomBrown in #3206
refactor: Clean up allreduce module for Deepseek V3 model by @hyukn in #3829
[feat]: Allow for a settable end-of-sequence/padding token in max throughput benchmark. by @FrankD412 in #3776
feat: Add multimodal embedding field in LlmRequest by @katec846 in #3855
Llama4 processor fixes by @milesial in #3994
fix: Add attention workspace memory check by @hlu1 in #3970
feat: add relaxed acceptance for DS by @yweng0828 in #3865
fix:https://nvbugs/5246733 by @nv-guomingz in #3989
model: support Qwen3 by @byshiue in #4010
test: [CI] Add failed cases into waives.txt by @xinhe-nv in #3943
feat: Support Top-K logprobs and prompt_logprobs in LLMAPI by @hchings in #3388
[AutoDeploy] Make all ranks agree on kv-cache size by @suyoggupta in #4007
feat: LogitsProcessor in PyTorch backend by @hchings in #3145
fix: Fallback to NCCL for various patterns when input size is large. by @hyukn in #4009
feat: [AutoDeploy] unfusing attention by @lucaslie in #3668
feat: Add group_rms_norm kernel to normalize multiple inputs in a single operator. by @SimengLiu-nv in #3438
model/infra: add ci and doc for qwen3 by @byshiue in #4022
[Deepseek][fix] Fix Deepseek MTP with moe_backend=TRTLLM by @hlu1 in #4001
*...

Contributors

Superjomn, dcampora, and 58 other contributors

Assets 2

29 Apr 08:54

kaiyux

v0.20.0rc1

d747223

v0.20.0rc1 Pre-release

Pre-release

Highlights

Features
- PyTorch workflow
  - Added LoRA support. (#3648) (#3788)
  - Added return logits support. (#3221)
- Part 1 of large-scale EP: Added MNNVL MoE A2A support. (#3504)
- Added smart router for the MoE module. (#3641)
- Added head size 72 support for QKV preprocessing kernel. (#3743)

What's Changed

move pytorch tests of LLM API into separate test files by @QiJune in #3745
Fix double link to fp8_blockscale_gemm_src by @WilliamTambellini in #3707
feat: add QMMA-based MLA kernels by @PerkzZheng in #3752
chore: add pull request template by @byshiue in #3760
Add running E2E LoRA flow by @shaharmor98 in #3648
[infra] Waive L0 tests by @yiqingy0 in #3784
feat: Add smart router for moe module by @zongfeijing in #3641
test: add rcca tests 4753548 by @xinhe-nv in #3716
fix: nvbugs/5234029 fix Qwen2.5-VL image test by @yechank-nvidia in #3726
test: [CI] Add failed cases into waives.txt by @xinhe-nv in #3696
fix: Intercept the error of multi-ranks bound to a single device by @Shixiaowei02 in #3525
fix: remove the unnecessary metadata changes in mtp by @lfr-0531 in #3787
test: Add DeepSeek-V3-Lite GSM8K tests by @syuoni in #3771
infra: [TRTLLM-4417]Support auto trigger special test stage for special file change by @ZhanruiSunCh in #3478
[TRTLLM-4763][test] Accuracy test improvement (Part 3.6): Deprecate mmlu_llmapi.py by @syuoni in #3802
add passing E2E LoRA flow by @shaharmor98 in #3788
fix: Limit llama4 context length to 8k by @mikeiovine in #3778
fix: Fix C++ decoder synchronization in PyTorch by @dcampora in #3106
fix: 5197419 and removed unused runtime kernels by @hypdeb in #3631
chore: reorganize some unit tests of PyTorch by @QiJune in #3780
doc: fix path after examples migration by @kaiyux in #3814
chore: fix some invalid paths of contrib models by @QiJune in #3818
chore: Fix KV cache block reuse flag name in quickstart_advanced by @mikeiovine in #3781
Fix create_weights in attention by @hlu1 in #3692
test: [CI] Add failed cases into waives.txt by @xinhe-nv in #3777
[https://nvbugspro.nvidia.com/bug/5238602][fix] Package lm_eval configuration files by @syuoni in #3809
[https://nvbugspro.nvidia.com/bug/5238599][fix] Normalize example path in accuracy tests by @syuoni in #3805
fix: Set default prompts and media for multimodal quickstart example by @qixiang-99 in #3792
Fix: Revert commit 25f9669 by @Shixiaowei02 in #3832
chore: bump version to 0.20.0rc1 by @ZhanruiSunCh in #3834
feat(part 2): Enhance the integrated robustness of scaffolding with init_.py #3305 by @WeiHaocheng in #3731
fix: fix lora case failure by @HuiGao-NV in #3838
Added NemotronH to PyTorch supported models by @vegaluisjose in #3663
Adding local paths to the datasets to make them loadable in offline mode by @rakib-hasan in #3750
fix: [Deepseek] Pass hidden_states_fp4 to shared_experts by @hlu1 in #3819
chore: increase A30 for cpp test by @QiJune in #3811
feat: Return logits in PyTorch flow by @tongyuantongyu in #3221
feat: large-scale EP(part 1: Add MNNVL MoE A2A support) by @dongxuy04 in #3504
[infra] Waive L0 tests by @yiqingy0 in #3853
[chore] Add Llama 4 Maverick to quickstart README by @mikeiovine in #3848
fix:[AutoDeploy] Patch for torch load_state_dict() by @sugunav14 in #3847
feat: Add head size 72 support for QKV Preprocessing kernel by @qixiang-99 in #3743
chore: update pytorch only change file list by @QiJune in #3873
Test: Split C++ unit tests for CI granularity by @DomBrown in #3868
TRTLLM-4875 feat: Add version switcher to doc by @kaiyux in #3846

Full Changelog: v0.20.0rc0...v0.20.0rc1

Contributors

WilliamTambellini, dcampora, and 25 other contributors

Assets 2

0 Join discussion

23 Apr 15:42

kaiyux

v0.20.0rc0

b16a127

v0.20.0rc0 Pre-release

Pre-release

Highlights

Model Support
- Added Nemotron-H model support (#3430)
- Added Dynasor-CoT in scaffolding examples (#3501)
Features
- Added stream generation task scaffolding examples (#3527)
- Added unfused RoPE support in MLA (#3610)
- Multimodal models
  - Added support in trtllm-serve (#3590)
  - Added support in trtllm-bench, the support is limited to image only for now (#3490)
- [Experimental] The TensorRT-LLM Triton backend has supported the LLM API (triton-inference-server/tensorrtllm_backend#742)
Performance
- Optimized Large Embedding Tables in Multimodal Models (#3380)
Infra
- Dependent datasets version was upgraded to 3.1.0 (#3490)

What's Changed

chore: Unify Python NVTX call by @kaiyux in #3450
doc: genai-perf benchmark & slurm multi-node for trtllm-serve doc by @LinPoly in #3407
fix: disable KV cache reuse if using attention sink by @Funatiq in #3021
doc: Minor fixes for documents by @kaiyux in #3577
fix: nvbugs/5075538: fix cross attention mask when decoder input len > 1 by @VALLIS-NERIA in #3585
chore: Mass integration of release/0.18 by @dcampora in #3421
fix: LLM API _hf_model_dir for non-cached case by @syuoni in #3562
ci: waive test_llm_multi_node_pytorch by @Superjomn in #3592
fix: amend trtllm-bench command in the test by @Superjomn in #3563
feat: Add stream generation task scaffolding examples by @narutolhy in #3527
chore: bump version to 0.20.0rc0 by @ZhanruiSunCh in #3561
chore: Add comments to modifications that fix TP size of DeepSeek-V3/R1 when using more than 16 GPUs by @jinyangyuan-nvidia in #3572
chore: waive test_llm_phi_quantization_1gpu by @QiJune in #3603
feat: Support cos_sin_cache in all cases. by @yuxianq in #3517
fix: add SM90 guard for FP8 Blockscale GEMM by @lucifer1004 in #3575
infra: Update user list by @niukuo in #3614
feat: Adding FP8 BMM from Codegen by @evezhier in #3541
waive test_llm_multi_node_with_postproc by @QiJune in #3628
fix: Use hmac authentication for pickle encryption by @yibinl-nvidia in #3384
Clean up linear.py, mlp.py, gated_mlp.py by @hlu1 in #3553
feat: Support CUDA graphs for EAGLE3 by @mikeiovine in #3176
feat: Nemotron-H model support by @vegaluisjose in #3430
waive test_fp8_scaled_mm by @QiJune in #3637
disable ib for ucx test by @chuangz0 in #3613
tests: change qa perf test to trtllm-bench by @ruodil in #3189
test: add quickstart test for nemotron-ultra by @crazydemo in #3596
feat: Add support for smaller hidden_dim in AR fusion kernel by @yilin-void in #3609
Fix rotary_emb param in NemotronH attention by @vegaluisjose in #3646
chore: Use ellipsis as default value to detect whether residual argument is provided by @yuxianq in #3626
feat/loraOp by @danielafrimi in #3455
Cherry-pick: update fp8 doc (#3647) by @litaotju in #3650
test: [CI] Add failed cases into waives.txt by @xinhe-nv in #3627
Waive L0 tests by @yiqingy0 in #3651
feat: allocate minimal blocks per window size by @netanel-haber in #3028
test: remove benchmark test list on main branch by @crazydemo in #3644
feat: Support unfused rope in MLA. by @yuxianq in #3610
fix: Fix fused_moe cache fallback issue. by @hyukn in #3652
fix: Correct reporting of text dtype for Llama 4. by @FrankD412 in #3494
chore: waive test_llm_multi_node by @QiJune in #3664
chore: update multi gpu trigger file list by @QiJune in #3665
feat: adding multimodal (only image for now) support in trtllm-bench by @rakib-hasan in #3490
fix sage attention headsize check error in bertAttentionPlugin.cpp by @Jackch-NV in #3660
fix: llama4: address couple of issues in llama4 attention module by @chang-l in #3491
chore: Refactor test_disaggregated.py by @Tabrizian in #3154
test: Add llama 4 to ci by @dongfengy in #3520
chore : Split more tests out of gpt tests by @peaceh-nv in #3524
infra: Add step to generate new duration file by @EmmaQiaoCh in #3298
refactor: Clean up CMakeLists.txt by @tongyuantongyu in #3479
test: Unwaive test for nvbug_5150466 by @hchings in #3552
feat: Add Dynasor-CoT in scaffolding examples by @Fsanic in #3501
feat: Integrate GPUDirect Storage (GDS) into Executor API by @DomBrown in #3582
Remove dummy forward path by @HuiGao-NV in #3669
fix: hmac in remote mpi session by @Superjomn in #3649
test: add kv cache event tests for disagg workers by @zhengd-nv in #3602
chore: enable test_ptp_quickstart_advanced_mixed_precision back by @QiJune in #3667
feat: Disaggregated router class by @pcastonguay in #3584
Updating the run.py to make the draft target model run with the LLaMa 3 1B/8B by @mayani-nv in #3615
feat: trtllm-serve multimodal support by @yechank-nvidia in #3590
chore: Waive disaggregated load balance by @Tabrizian in #3687
Clean up modeling_deepseek.py by @hlu1 in #3640
fix: Fix disaggregated load balance test by @Tabrizian in #3689
feat: Introduce feature properties for attention backend. by @yuxianq in #3659
test:update waives.txt for nvbug 5219532 by @nv-guomingz in #3672
test: Get Eagle tests working by @brb-nv in #3593
move the reset models into examples/models/core directory by @QiJune in #3555
fix: Refactor Deepseek tp_size calculation by @hlu1 in #3695
Update Nemotron Super and Ultra in Supported Models and add an example by @Naveassaf in #3632
infra: Add test list name check by @EmmaQiaoCh in #3097
feat: [Deepseek] Add trtllm-gen MOE FP4 MOE backend by @hlu1 in #3387
Waive L0 tests by @yiqingy0 in #3709
fix: update test_user_buffers_mm_add_prologue atol (#3711) by @liji-nv in #3713
fix: Support TLLM_OVERRIDE_LAYER_NUM for llama4. by @yuxianq in #3679
Report number of context tokens in one iteration by @HuiGao-NV in #3691
fix: Remove ParallelConfig. by @yuxianq in #3678
feat: Offloading Multimodal embedding table to CPU in Chunked Prefill Mode by @katec846 in #3380
fix: fix cublas_scaled_mm by @dc3671 in #3600
chore: update FMHA cubin files by @jinyangyuan-nvidia in #3680
test: add llama3.2 ptp test case by @StanleySun639 in #3363
bug: Fix hang bug when context server doesn't have enough capacity for KV Cache by @Tabrizian in #3095
refact: use pybind block key and hasher in disagg worker test by @zhengd-nv in #3712
Fix: nvbugs/5232457 ModelOpt Mixtral AWQ OOM by @Barry-Delaney in #3714
ci: unwaive multi_node test by @Superjomn in https://github.com/...

Contributors

Superjomn, dcampora, and 58 other contributors

Assets 2

0 Join discussion

18 Apr 23:19

kaiyux

v0.19.0rc0

258ae9c

v0.19.0rc0 Pre-release

Pre-release

Model Support
- Added Llama 4 support. (#3302)
- Added support for Phi‑4‑MM (#3296)
- Added Gemma3 text‑only model support. Refer to "Run Gemma 3" section at examples/gemma/README.md. (#3247)
- Added Qwen2.5‑VL support for PyTorch workflow and refactored Qwen2‑VL (#3156)
Features
- Added FP8 support for SM120 architecture (#3248)
- Registered ENABLE_MULTI_DEVICE and ENABLE_UCX as CMake options (#3343)
- Made the scaffolding Controller more generic (#3416)
- Breaking change: Added individual gatherContext support for each additional output (#3374)
- Added trtllm‑gen FP4 GEMM for the PyTorch workflow (#3423)
- Added Qwen2 MoE support for PyTorch flow (#3369)
- Enabled PyExecutor inference flow to estimate max_num_tokens for kv_cache_manager (#3092)
- Added TLLM_OVERRIDE_LAYER_NUM and TLLM_TRACE_MODEL_FORWARD environment variables for debugging (#3417)
- Applied the PyTorch workflow compatible AutoTuner to both Fused MoE and NVFP4 Linear operators (#3151)
- Introduced a UserBuffers allocator for PyTorch flow (#3257)
- Supported aborting disconnected requests (#3214)
- Added support for FP8 MLA on NVIDIA Hopper and Blackwell GPUs (#3190)
- Added an option to run disaggregated serving without context servers (#3243)
- Enhanced RoPE support in AutoDeploy (#3115)
- Fixed and improved allreduce and fusion kernels (#3064)
- Added DeepSeek-V3 support in AutoDeploy (#3281)
- Enhanced the integrated robustness of scaffolding via init.py (#3312)
API
- Added numNodes to ParallelConfig (#3346)
- Redesigned the multi‑stream API for DeepSeek (#3459)
Bug fixes
- Fixed a wrong import of KvCacheConfig in examples/gpqa_llmapi.py (#3369)
- Fixed the test name (#3534)
- Fixed max_seq_len in executor_config (#3487)
- Removed a duplicated line of code (#3523)
- Disabled kv cache reuse for the prompt tuning test (#3474)
- Fixed the issue of a first‑generation token being returned twice in streaming (#3427)
- Added kv memory size per token calculation in the draft model (#3497)
- Switched ZMQ from a file socket to a TCP socket in RemoteMpiCommSession (#3462)
- Fixed PP for Llama (#3449)
- Updated the default excluded_modules value for the fp8rowwise recipe (#3477)
- Fixed disaggregation MTP with overlap (#3406)
- Stopped memory estimation in start_attention (#3485)
- Allowed the context_and_generation request type in disaggregated overlap (#3489)
- Fixed the partial match issue (#3413)
- Fixed Eagle decoding (#3456)
- Fixed the py_decoding_iter update in the decoder (#3297)
- Fixed the beam search diversity issue (#3375)
- Updated ucxx to avoid occasional segfaults when profiling (#3420)
- Fixed redrafter sampling (#3278)
- Fixed mllama end‑to‑end PyTorch flow (#3397)
- Reverted an extra CMake variable (#3351)
- Fixed issues with the fused MoE path (#3435)
- Fixed conflicting test names (#3316)
- Fixed failing DeepSeek-V3 unit tests (#3385)
- Fixed missing bias addition for FP4Linear (#3361)
- Fixed the runtime error in test_deepseek_allreduce.py (#3226)
- Fixed speculative decoding and multimodal input support (#3276)
- Fixed PyTorch nvsmall via PyExecutor and improved TP support (#3238)
- Fixed the p‑tuning test bug (#3326)
Performance
- Cached sin and cos in the model instead of using a global LRU cache (#3378)
- Deallocated tensors after use in MLA (#3286)
- Enabled DeepGEMM by default (#3341)
- Added a thread leak check and fixed thread/memory leak issues (#3270)
- Used cudaMalloc to allocate kvCache (#3303)
- Made ipc_periodically the default responses_handler (breaking change) (#3102)
- Used NVRTC for DeepGEMM JIT compilation (#3239)
- Optimized quantization kernels used in DeepSeek on Hopper (#3466)
Documentation
- Added an example section for the multi‑node DeepSeek R1 benchmark on GB200 (#3519)
- Documented disaggregation performance tuning (#3516)
- Updated the perf‑benchmarking documentation for GPU configuration (#3458)
- Updated the README and added a benchmarking blog for DeepSeek‑R1 (#3232)
- Updated the documentation for using Draft‑Target‑Model (DTM) (#3366)
- Updated the README for disaggregated serving (#3323)
- Updated instructions to enable FP8 MLA for Deepseek. (#3488)

Full change log: 5aeef6d...258ae9c.

Assets 2

16 Apr 06:47

kaiyux

v0.18.2

5aec7af

TensorRT-LLM Release 0.18.2

Key Features and Enhancements

This update addresses known security issues. For the latest NVIDIA Vulnerability Disclosure Information visit https://www.nvidia.com/en-us/security/.

Assets 2

0 Join discussion

09 Apr 01:11

kaiyux

v0.18.1

62f3c95

TensorRT-LLM Release 0.18.1

Key Features and Enhancements

The 0.18.x series of releases builds upon the 0.17.0 release, focusing exclusively on dependency updates without incorporating features from the previous 0.18.0.dev pre-releases. These features will be included in future stable releases.

Infrastructure Changes

The dependent transformers package version is updated to 4.48.3.

Assets 2

4 Join discussion

Releases: NVIDIA/TensorRT-LLM

v0.21.0rc1

Highlights

What's Changed

Contributors

Uh oh!

v0.21.0rc0

Highlights

What's Changed

Contributors

Uh oh!

v0.20.0rc3

Highlights

What's Changed

Contributors

Uh oh!

v0.19.0

TensorRT-LLM Release 0.19.0

Key Features and Enhancements

API Changes

Fixed Issues

Infrastructure Changes

Known Issues

Uh oh!

v0.20.0rc2

Highlights

What's Changed

Contributors

Uh oh!

v0.20.0rc1

Highlights

What's Changed

Contributors

Uh oh!

v0.20.0rc0

Highlights

What's Changed

Contributors

Uh oh!

v0.19.0rc0

Uh oh!

TensorRT-LLM Release 0.18.2

Key Features and Enhancements

Uh oh!

TensorRT-LLM Release 0.18.1

Key Features and Enhancements

Infrastructure Changes

Uh oh!