Releases: NVIDIA/TensorRT-LLM
Releases · NVIDIA/TensorRT-LLM
v0.21.0rc1
Highlights
- Model Support
- Add HyperCLOVAX-SEED-Vision support for PyTorch flow (#4799)
- Features
- Support generation logits in TRTLLM Sampler (#4819)
- Support for large scale-EP(#4818)
- Support XQA-based MLA on SM120 (#4858)
- Add PositionEmbeddingType=0 to xqa support (#4934)
- Add cache reuse support (selective cache transfer) in mla cache formatter (#4749)
- Update DeepSeek FP8 TRT-LLM Gen cubins (#4643)
- Add heuristics for checkpoint files prefetching (#4765)
- Enable NVFP4 output for TRTLLM attention kernels (#4737)
- Refactor Fused MoE (#4790)
- Add integration of etcd (#3738)
- Memoize weight shuffle index to speed up weight preproc in moe_backend=TRTLLM (#4826)
- Enable Disaggregated serving for QWen-3 (#4929)
- API
- Set _AutoDeployLlmArgs as primary config object (#4891)
- Bug Fixes
- Fix warmup phase batch size out of range (#4986)
- Fix buffer count (#5007)
- Fix nvbug 5324252 test_resource_manager.py broken (#4925)
- Fix nvbug 5280806 2 model spec decode flow (#4807)
- Fix nvbug 5324248 test_pytorch_model_engine.py broken (#4973)
- Fix cuda graph padding for spec decoding (#4853)
- Correct the order of llm request state (#4781)
- Handle OOMs during KV cache estimation (#4690)
- Only pass fast_build=true to non-pytorch backend (#4920)
- Fix the no fusion all reduce hanging (#4594)
- Deprecate AutoDeploy CI post-merge tests and keep them for local testing (#4892)
- Fix nvbug 5302895 test_trtllm_bench_llmapi_launch fail(#4835)
- Fix llama 4 long context issue (#4809)
- Fix nvbug 5300080 the bug of setting attention_chunk_size and enable
- chunked-attention in the generation-phase by default (#4693)
- Fix nvbug 5294316 queued request stats (#4714)
- Fix max_num_sequences calculation with overlap scheduling (#4532)
- Fix trtllm-bench hang issue due to LLM API IPC (#4798)
- Fix a pd+mtp accuracy issue (#4536)
- Benchmark
- Performance
- Infrastructure
- Documentation
- Known Issues
What's Changed
- upgrade cutlass to 4.0 by @yunruis in #4794
- [feat] Enable NVFP4 output for TRTLLM attention kernels by @Tom-Zheng in #4737
- [https://nvbugs/5271281][fix] fix a pd+mtp accuracy issue by @lfr-0531 in #4536
- fix [nvbug5256044]: bench hang due to llmapi ipc by @Superjomn in #4798
- [nvbugs/5303555] ci: unwaive test_fp8_block_scales_cuda_graph_padding by @Funatiq in #4735
- fix: remove the accuracy assert on run_majority_vote_aime24.py #5340 by @WeiHaocheng in #4784
- feat: add heuristics for checkpoint files prefetching. by @yuxianq in #4765
- tests: [TRTQA-2905] improve timeout report for qa test cases by @crazydemo in #4753
- shorten reqs in con:1 cases and add streaming cases, and add l2 perf … by @ruodil in #4849
- Add pre-merge Triton backend tests by @Tabrizian in #4842
- [Architecture] Refactor FusedMoE by @hlu1 in #4790
- fix: max_num_sequences calculation with overlap scheduling by @Funatiq in #4532
- refactor: Separate DecoderState from GptDecoderBatched by @Funatiq in #4700
- [enhanchment] Add beam width to low latency. by @FrankD412 in #4812
- fix: Register MoeLoadBalancerConfig to serialization.py by @syuoni in #4864
- feat: Add integration of etcd by @Shunkangz in #3738
- [nvbug 5294316] fix: Fix queued request stats by @pcastonguay in #4714
- chore: remove request_error ipc in LLM.submit by @Superjomn in #4763
- [Doc] Fix readme for disaggregated serving by @arekay in #4846
- chore: Waive examples/test_mistral.py::test_llm_mistral_v1_1gpu. by @SimengLiu-nv in #4873
- [Arch] Freeze model_config by @hlu1 in #4814
- [TRTLLM-5053] Refactoring and Unifying the Multimodal input preparation by @rakib-hasan in #4506
- feat: update DeepSeek FP8 TRT-LLM Gen cubins by @nekorobov in #4643
- [https://nvbugspro.nvidia.com/bug/5300080] Fix the bug of setting attention_chunk_size and enable chunked-attention in the generation-phase by default by @PerkzZheng in #4693
- [fix] Fix llama 4 long context by @mikeiovine in #4809
- Replace memset with data initialization within kernels by @ChristinaZ in #4851
- Refactor the first token response in PD by @Shunkangz in #4692
- Fix: NVBug 5302895 by @Shixiaowei02 in #4835
- feat: cache reuse support (selective cache transfer) in mla cache formatter by @zhengd-nv in #4749
- feat: Enhance AutoTuner inference path and code readability by @hyukn in #4466
- Chore: refine comments of prepare inputs method of model engine by @QiJune in #4837
- fix: build_config in TorchLlmArgs and avoid invalid args by @Superjomn in #4600
- chore: Mass integration of release/0.20. by @omera-nv in #4871
- [TRTLLM-4923][feat] Paged mamba cache by @tomeras91 in #4822
- chore: bump version to 0.21.0rc1 by @ZhanruiSunCh in #4896
- Fix: draft target README and set exclude_input_in_output to False by @eagle705 in #4882
- fix: correct the order of llm request state by @zhengd-nv in #4781
- fix: trtllm-bench iter_stats and cuda_graph_batch_sizes error errors. by @qiaoxj07 in #4827
- chore: introduce KvCacheCreator by @ixlmar in #4581
- tests: Update gb200 test case by @yizhang-nv in #4754
- fix: Fix broken vanilla moe since FusedMoE refactor. by @yuxianq in #4897
- fix: LLM invalid arg in a test by @Superjomn in #4922
- [AutoDeploy] deprecate CI post-merge tests and keep them for local testing by @lucaslie in #4892
- [infra] Unwaive unittests/_torch by @mikeiovine in #4919
- [TRTLLM-4647][fix] Fix the no fusion allreduce hanging by @timlee0212 in #4594
- tests: fix 5273697 by @xinhe-nv in #4685
- Waive L0 tests by @yiqingy0 in #4927
- Only pass
fast_build=true
to non-pytorch backend by @netanel-haber in #4920 - tests: [TRTQA-2906] add benchmark serving tests by @xinhe-nv in #4901
- fix: handle OOMs during KV cache estimation by @ixlmar in #4690
- CI: waive test_llm_get_queued_stats by @QiJune in #4945
- [AutoDeploy] _AutoDeployLlmArgs as primary config object by @lucaslie in #4891
- Revert "[infra] Unwaive unittests/_torch" by @QiJune in #4950
- Revert "fix: build_config in TorchLlmArgs and avoid invalid args" by @QiJune in #4949
- [TRTLLM-5630] restore free_gpu_memory_fraction=0.9 in tests by @ixlmar in #4859
- Add disaggregated unittest by @Shunkangz in #4899
- Waive L0 tests by @yiqingy0 in #4953
- fix a bug of global cuda graph dummy request by @QiJune in #4894
- Fix: fix autodeploy by @QiJune in #4957
- feat : add PositionEmbeddingType=0 to xqa support by @dongjiyingdjy in #4934
- update fmha_v2 by @qsang-nv in #4895
- blog: Scaling Expert Parallelism in TensorRT-LLM (Part 1: Design and Implementation of Large-scale EP) by @kaiyux in https://github....
v0.21.0rc0
Highlights
- Model Support
- Features
- Support for large-scale EP (#4384, #4495 , #4615)
- Added chunked attention kernels (#4291, #4394)
- ScaffoldingLLM now supports MCP (#4410)
- Integrated NIXL into the communication layer of the disaggregated service (#3934, #4125)
- Integrated Hopper chunked attention kernels (#4330)
- Enabled TRT backend for Python runtime in disaggregated service (#4243)
- Added FP8 block-scale GEMM support on SM89 (#4481)
- Qwen3 FP4 MoE TRTLLM backend for low-latency (#4530)
- Introduced sliding-window attention kernels for the generation phase on Blackwell (#4564)
- Vanilla MOE added (#4682)
- Fused QKNorm + RoPE integration (#4611)
- Fabric Memory support for KV Cache Transfer (#4717)
- API
- Bug Fixes
- Resolved Torch compile issue for DeepSeek V3 (#3952)
- Fixed trtllm-llmapi-launch for single-node, single-GPU setups (#4428)
- Removed duplicate tokenization in generation server (#4492)
- Fixed cancel request handling for attentionDP (#4648)
- Fixed disaggregated service hang when MNNVL two-shot AllReduce is enabled (#4678)
- Fixed queued request statistics (#4806)
- Fixed EP load balancer with MTP layer and route offset by EP rank (#4767)
- Resolved accuracy and illegal memory access issues with MTP + attention DP (#4379)
- Benchmark
- Added all_reduce.py benchmark script for testing (#4537)
- Performance
- Infrastructure
- Documentation
- Known Issues
What's Changed
- Refine doc by @juney-nvidia in #4420
- Refine doc by @juney-nvidia in #4421
- refine doc by @juney-nvidia in #4422
- Remove vila test by @Tabrizian in #4376
- [TRTLLM-4618][feat] Add Nemotron Super 49B FP8 test on RTX6000 Pro (SM120) by @farazkh80 in #4363
- tests: add qa test mentioned in docs by @crazydemo in #4357
- [Infra] - Always push the release images in the post-merge job by @chzblych in #4426
- tests: Add test cases for rcca cases by @crazydemo in #4347
- chore: cleanup perf_evaluator code by @Superjomn in #3833
- feat: Add pp support for hybrid attn/mamba model by @yuxianq in #4358
- fix: wrong argument name
enable_overlap_scheduler
by @kaiyux in #4433 - Update "Roadmap" link under README.md to the issues with Roadmap label by @AdamzNV in #4425
- fix potential issues in allreduce fusion kernel and ut by @yilin-void in #4226
- [TRTLLM-4638] feat(scaffolding): update Reward Controller to PRM specific controller with step split by @dc3671 in #4337
- feat: NIXL interface integration by @Shixiaowei02 in #3934
- Downgrade the logger level for fallback tactic warning. by @hyukn in #4440
- Test: Improve model re-use in C++ DGX tests for CI stability by @DomBrown in #4263
- fix: temp disable the problem test by @Shixiaowei02 in #4445
- Add llama4 disagg accuracy tests by @Tabrizian in #4336
- [https://nvbugs/5123103][fix] Fix torch compile for DeepSeekV3 by @liji-nv in #3952
- [Docs] - Reapply #4220 by @chzblych in #4434
- [TRTLLM-4618][feat] Fix cutlass MoE GEMM fallback failure on FP8 + add e2e test for Mixtral 8x7B FP8 on RTX6000 Pro (SM120) by @farazkh80 in #4335
- [Feat] add chunked-attention kernels on Hopper (for llama4) by @PerkzZheng in #4291
- test(perf): Add some
Llama-3_3-Nemotron-Super-49B-v1
integration-perf-tests (TRT flow, trtllm-bench) by @venkywonka in #4128 - fix: [nvbugs/5287097] Align PP layer distribution between pytorch and TRT flow. by @yuxianq in #4399
- feat: Low Precision Allreduce for PCIe based GPU by @kanghui0204 in #4344
- test: [CI] Add failed cases into waives.txt by @xinhe-nv in #4429
- [TRTLLM-4932] Add CLI accuracy tests for Llama-3.3-70B-Instruct and LLM API BF16 variant by @moraxu in #4362
- test: update test filter in perf test yml file to select cases by gpu name and add cases for RTX 6000 pro by @ruodil in #4282
- [AutoDeploy] HF factory improvements by @lucaslie in #4371
- chore: bump version to 0.21.0rc0 by @ZhanruiSunCh in #4465
- doc: [TRTLLM-325]Integrate the NGC image in Makefile automation and document by @MartinMarciniszyn in #4400
- infra: Add qwen3 235B tests into QA by @byshiue in #4483
- feat: large-scale EP(part 2: MoE Load Balancer - core utilities) by @dongxuy04 in #4384
- [TRTLLM-5085][fix] Nemotron H correctness test by @tomeras91 in #4444
- [Docs] - Add date and commit info by @chzblych in #4448
- fix[nvbug/5286515]: trtllm-llmapi-launch on single node single gpu by @Superjomn in #4428
- fix: replace the image links in the blog by @Shixiaowei02 in #4489
- fix: Fix TRTLLMSampler beam width bug. by @dcampora in #4473
- refactor: Unify request order in TRT and PyTorch workflow by @Funatiq in #4096
- [TRTLLM-5273]feat/Use full attention mask if Llama3 is used as encoder and fix EarlyStopDecoder unsqueeze bug by @nvrohanv in #4290
- Build Triton for arm by @Tabrizian in #4456
- test: [CI] remove closed bugs by @xinhe-nv in #4417
- test(perf): Add remaining
Phi-4-mini-instruct
perf tests by @venkywonka in #4443 - feat: conditional disaggregation in disagg server by @zhengd-nv in #3974
- perf: Fuse gemm setup function for SM90/SM100 MOE plugin path by @djns99 in #4146
- fix: skip weights defined in create_weights for pp. by @yuxianq in #4447
- Feat: add chunked-attention kernels on Blackwell by @PerkzZheng in #4394
- fix [nvbug/5220766]: llmapi-launch add add trtllm-bench test with engine building by @Superjomn in #4091
- [TRTLLM-5000][feat] Pytorch implementation of ngram drafter by @thorjohnsen in #3936
- test: NIXL single process test by @Shixiaowei02 in #4486
- Chore: waive torch compile test cases of deepseek v3 lite by @QiJune in #4508
- Feat: add deep_gemm swapab Kernel by @ruoqianguo in #4430
- unwaive some disagg test by @chuangz0 in #4476
- Clean: fmha codes by @PerkzZheng in #4496
- tests: add llama 3.3 70b 2 nodes tests by @xinhe-nv in #4391
- CI: waive test_fp8_block_scales_4gpus of deepseek v3 lite by @QiJune in #4520
- test: remove enable_overlap_schedule in pytorch config and set enable_chunked prefill to be true for isl>2048 cases by @ruodil in #4285
- docs: update the introduction for scaffolding by @WeiHaocheng in #4360
- test: add failed case in waive list and fix some test script issue for perf test by @ruodil in #4527
- tests: add qwene fp4 tests into QA test list & update sanity test list by @xinhe-nv in #4478
- feat: large-scale EP(part 3: refactor - FusedMoe for redundant expert) by @dongxuy04 in #4495
- refactor: DisaggExecutorTest by @Funatiq in #4398
- chore: clean ucx and nixl mirror. by @nv-guomingz in h...
v0.20.0rc3
Highlights
- Model Support
- Features
- Adopt new
logprob
definition in PyTorch flow (#4057) - Support multiple LoRA adapters and TP (#3885)
- Add Piecewise CUDA Graph support (#3804)
- Add KV cache-aware router for disaggregated serving (#3831)
- Enable per-request stats with PyTorch backend (#4156)
- Support DeepSeek-R1 W4A8 on Hopper (#4123)
- Enable chunked context for FlashInfer (#4132)
- Support KV cache reuse for MLA (#3571)
- Adopt new
- API
- Bug Fixes
- Benchmark
- Performance
- Infrastructure
- The base Docker image for TensorRT-LLM is updated to
nvcr.io/nvidia/pytorch:25.04-py3
(#4049) - The dependent TensorRT version is updated to 10.10.0 (#4049)
- The dependent CUDA version is updated to 12.9.0 (#4049)
- The dependent public PyTorch version is updated to 2.7.0.
- The pre-built TensorRT-LLM wheel on PyPI is linked against PyTorch 2.7.0 now, which uses the CXX11 ABI (#4235)
- The base Docker image for TensorRT-LLM is updated to
- Documentation
- Known Issues
What's Changed
- feat: adopt new logprob definition in PyTorch flow by @tongyuantongyu in #4057
- infra: Add NIXL into the Dockerfile by @Shixiaowei02 in #3981
- feat: support multi lora adapters and TP by @shaharmor98 in #3885
- feat: Fallback to NCCL for various patterns when input size is large. by @hyukn in #4080
- Cherry-pick trtllm-gen from feat/llama4 to main by @chenfeiz0326 in #4086
- [fix] [AutoDeploy] flashinfer usage on H100 by @lucaslie in #4162
- fix: Fix incorrect conversion of Gen TPS/user by @FrankD412 in #4112
- [fix] Fix llama4 + eagle3 by @mikeiovine in #3998
- Support RingAttention in the BertAttention plugin and the DiT model by @ChunhuanLin in #3661
- fix: alltoall padding for chunked MoE by @dongxuy04 in #4157
- [feat] Allow overriding cli args with yaml file in trtllm-serve by @pcastonguay in #4164
- [TRTLLM-5147][Qwen3] fix: fix bug of attention dp on qwen3_moe model by @byshiue in #4141
- chore: Clean up the legacy DeepseekAllreudceFusionOp. by @hyukn in #4081
- test: add qwen3 and disaggregated serving accuracy tests to qa test list by @StanleySun639 in #4083
- [TRTLLM-3105][feat] Add Piecewise CUDA Graph Support by @yizhang-nv in #3804
- fix: change pp broadcast pattern for LPs by @hchings in #4130
- [#4085][fix] Fix
apply_per_channel_scale
for extremely large input sequence length. by @StudyingShao in #4089 - [nvbug/5262268][fix] Fix trtllm-bench for llama 4 by @mikeiovine in #4104
- chore: Fix pipeline break caused by previous PR (#4081) rebase + pipeline reuse by @hyukn in #4169
- [https://nvbugspro.nvidia.com/bug/5260676]test: skip fp8 quantization case for pre-ada by @crazydemo in #4095
- test: move mistral / mixtral test cases in QA test list into the new accuracy test suite by @crazydemo in #3440
- test: Add fp8kv to DS-v3-lite integration tests. by @bobboli in #3950
- [fix] Fix relaxed acceptance to support enabling it in context phase by @lfr-0531 in #4126
- test: skip tests on b200 by @xinhe-nv in #3913
- infra: Fix pipeline step error in post merge by @ZhanruiSunCh in #3948
- fix: library path of nixl by @Shixiaowei02 in #4184
- test: amend default pytorch extra-llm-api-config.yml in perf test by @ruodil in #4176
- [fix] Fix add_dummy_requests for spec decoding cases by @lfr-0531 in #4084
- test: [CI] Add failed cases into waives.txt by @xinhe-nv in #4165
- feat: support task collection for to collect information (#3328) by @WeiHaocheng in #3824
- Cherry-pick: Use multi-threading to load MoE expert weights by @chenfeiz0326 in #4137
- test: amend regex match for perf throughput by @ruodil in #4186
- chore: reduce size of the docker images by @MartinMarciniszyn in #3990
- [fix] trtllm-gen mla kernel warnings by @zhhuang-nv in #4119
- chore: Deprecate evaltool by @Tracin in #4173
- [fix][nvbug/5244009] Fix llama 4 test lists/scout accuracy issue by @mikeiovine in #4069
- perf: [TRTLLM-4717][perf] Set CUDA graph max batch size and padding in throughput benchmark. by @FrankD412 in #3875
- Refactor: Restructure C++ tests for better modularisation of non-shared code by @DomBrown in #4027
- Updating the multimodal models README to add steps for running phi-4-multimodal instruct by @mayani-nv in #3932
- fix: draft target README and assertion for logits-based acceptance by @mayani-nv in #4167
- Add initial list of CODEOWNERS by @kevinch-nv in #4105
- chore: PR to fix the formatting errors by @mayani-nv in #4200
- test: Remove CNN Dailymail tasks in favor of GSM8K by @syuoni in #4187
- [CI] waive two multi-gpu test cases by @QiJune in #4206
- [CI] update pytorch only file list by @QiJune in #4210
- chore:update modelopt to 0.29 by @nv-guomingz in #4150
- [Infra] Waive L0 test by @yiqingy0 in #4212
- remove cache_transceiver_prealloc_size by @chuangz0 in #4153
- [TRTQA-2802][fix]: add --host for mgmn serve examples script by @xinhe-nv in #4175
- tests: https://nvbugs/5219534 remove failed tests from test list by @xinhe-nv in #4113
- test: add llama_3.2_1B model and fix for test lora script issue by @ruodil in #4139
- chore: Update CODEOWNERS by @Funatiq in #4221
- [https://nvbugspro.nvidia.com/bug/5270564][test] skip per-hopper for llama4 by @crazydemo in #4211
- [TRTLLM-4911] feat(scaffolding): make sampling_params only setable by controller by @dc3671 in #4151
- Feat: support exporting softmax statistics and update the kernel-selection heuristic by @PerkzZheng in #4155
- infra: [TRTLLM-325] Prepare for NGC release - multiplatform build by @MartinMarciniszyn in #4191
- [feat] Support HyperCLOVAX-SEED-Text language part by @yechank-nvidia in #3902
- feat: Support the Structural Tag in guided decoding by @Ubospica in #4066
- feat: add kv cache aware router by @zhengd-nv in #3831
- refactor: Allow models to override apply_qk_norm. by @yuxianq in #4078
- [https://nvbugs/5214229] [fix] Unwaive lm_head quantization case by @syuoni in #4222
- doc: update switcher.json config by @niukuo in #4220
- Revert "Add initial list of CODEOWNERS (#4105)" by @Funatiq in #4234
- [TRTLLM-5188] fix: [AutoDeploy] update output shape of prepare_fused_mha_metadata_fake by @Fridah-nv in #4199
- fix: Reset planned states to avoid memory leak in TrtllmAttentionWrapper by @yuxianq in #4227
- Feat: Variable-Beam-Width-Search (VBWS) part4 by @wili-65535 in #3979
- [TRTLLM-5081] [test] Align parametrize_with_ids to the pytest behavior by @syuoni in #4090
- fix: reshape token_ids for lp in torch backend by @hchings in #4239
- feat: Add heuristic for GroupRMSNorm kernel selection. by @SimengLiu-nv in https://github.com/NV...
v0.19.0
TensorRT-LLM Release 0.19.0
Key Features and Enhancements
- The C++ runtime is now open sourced.
- PyTorch workflow
- Added DeepSeek V3/R1 support. Refer to
examples/deepseek_v3/README.md
, also to the blogdocs/source/blogs/Best_perf_practice_on_DeepSeek-R1_in_TensorRT-LLM.md
. - Added Llava-Next support.
- Added BERT support.
- Added a C++ based decoder, which added support for:
- TopK / TopP.
- Bad words.
- Stop words.
- Embedding bias.
- Added Autotuner for custom-op-compatible tuning process.
- Added a Python-based Autotuner core framework for kernel tuning.
- Applied the Autotuner to fused MoE and NVFP4 linear operators for concept and performance evaluations.
- Added guided decoding support (XGrammar integration).
- Added pipeline parallelism support for the overlap scheduler in
PyExecutor
. - Added Qwen2VL model support.
- Added mixed precision quantization support.
- Added pipeline parallelism with attention DP support.
- Added no-cache attention support.
- Added
PeftCacheManager
support. - Added Qwen2.5‑VL support and refactored Qwen2‑VL.
- Added trtllm‑gen FP4 GEMM support.
- Added Qwen2 MoE support.
- Applied
AutoTuner
to both Fused MoE and NVFP4 Linear operators. - Introduced a
UserBuffers
allocator. - Added Deepseek eager mode AllReduce fusion support.
- Added Multi-Token Prediction (MTP) support. Refer to the “Multi-Token Prediction (MTP)” section of
examples/deepseek_v3/README.md
. - Added FlashMLA support for SM90.
- Added support for enabling MTP with CUDA graph padding.
- Added initial EAGLE-3 implementation.
- Added support for FP8 MLA on NVIDIA Hopper and Blackwell GPUs.
- Added DeepSeek V3/R1 support. Refer to
- AutoDeploy for PyTorch workflow.
- The AutoDeploy for PyTorch workflow is an experimental feature in
tensorrt_llm._torch.auto_deploy
. - AutoDeploy provides an automated path from off-the-shelf models to optimized deployment in the TensorRT-LLM runtime.
- Check out
examples/auto_deploy/README.md
for more details.
- The AutoDeploy for PyTorch workflow is an experimental feature in
- LLM API
- [BREAKING CHANGE] Added dynamic logits processor support, and deprecated static logits processor.
- Added batched logits processor support.
- Added EAGLE support.
- Added abort request support.
- Added
get_stats
support. - Added multi-node support for Slurm-based clusters, refer to
examples/llm-api/llm_mgmn_*.sh
.
- Added InternLM-XComposer2 support. Refer to “InternLM-XComposer2” section in
examples/multimodal/README.md
. - Added INT4-AWQ support for MoE models. Refer to the “AWQ Quantization” section in
examples/mixtral/README.md
. - Added Qwen2-Audio support. Refer to
examples/qwen2audio/README.md
. - Added Language-Adapter support. Refer to
examples/language_adapter/README.md
. - Added STDiT for OpenSoRA text-to-video support. Refer to
examples/stdit/README.md
. - Added vision encoders with tensor parallelism and context parallelism support. Refer to
examples/vit/README.md
. - Added EXAONE-Deep support. Refer to
examples/exaone/README.md
. - Added support for Phi-4-mini and Phi‑4‑MM.
- Added Gemma3 text‑only model support. Refer to "Run Gemma 3" section at
examples/gemma/README.md
. - Added FP8 quantization support for Qwen2-VL.
- Added batched inference support for the LLM API MMLU example
examples/mmlu_llmapi.py
. - Added FP4 quantization-layernorm fusion plugin support. (Llama models only)
- Added Mamba-Hybrid support.
- Added NVILA video support. The support includes 1 prompt - N media and N prompt - N media batching modes.
- Added a
--quantize_lm_head
optionexamples/quantization/quantize.py
to supportlm_head
quantization. - Added batched tensor FP4 quantization support.
- Added a
/metrics
endpoint fortrtllm-serve
to log iteration statistics. - Added LoRA support for Phi-2 model.
- Added returning context logits support for
trtllm-serve
. - Added one-shot version for UserBuffer AllReduce-Normalization on FP16/BF16.
- Added request BW metric measurement for
disaggServerBenchmark
. - Updated logits bitmask kernel to v3.
- Enabled CUDA graphs when attention DP was used and active requests on different GPUs were uneven.
- Added iteration log support for
trtllm-bench
. fp8_blockscale_gemm
is now open-sourced.- Added AWQ support for ModelOpt checkpoints.
- Added Linear block scale layout support in FP4 quantization.
- Added pre-quantized FP8 checkpoint support for Nemotron-mini-4b-instruct.
- Added Variable-Beam-Width-Search (VBWS) support (part2).
- Added LoRA support for Gemma.
- Refactored scaffolding worker, added OpenAI API worker support.
- Optionally split MoE inputs into chunks to reduce GPU memory usage.
- Added UCX IP interface support.
- [BREAKING CHANGE] Added output of first token to additional generation outputs.
- Added FP8 support for SM120 architecture.
- Registered
ENABLE_MULTI_DEVICE
andENABLE_UCX
as CMake options. - Made the scaffolding Controller more generic.
- Breaking change: Added individual gatherContext support for each additional output.
- Enabled
PyExecutor
inference flow to estimatemax_num_tokens
forkv_cache_manager
. - Added
TLLM_OVERRIDE_LAYER_NUM
andTLLM_TRACE_MODEL_FORWARD
environment variables for debugging. - Supported aborting disconnected requests.
- Added an option to run disaggregated serving without context servers.
- Fixed and improved allreduce and fusion kernels.
- Enhanced the integrated robustness of scaffolding via
init.py
.
API Changes
- Exposed
kc_cache_retention_config
from C++executor
API to the LLM API. - Moved
BuildConfig
arguments toLlmArgs
. - Removed speculative decoding parameters from stateful decoders.
- Exposed
DecoderState
via bindings and integrated it in decoder. - Refactored the
LlmArgs
withPydantic
and migrated remaining pybinding configurations to Python. - Refactored disaggregated serving scripts.
- Added
numNodes
toParallelConfig
. - Redesigned the multi‑stream API for DeepSeek.
Fixed Issues
- Fixed misused length argument of PluginField. This also fixes #2685.
- Fixed a Llama-3.2 SmoothQuant convert checkpoint issue. (#2677)
- Fixed a bug when loading an engine using LoRA through the LLM API. (#2782)
- Fixed incorrect batch slot usage in
addCumLogProbs
kernel. - Fixed incorrect output for Llama-3.2-11B-Vision-Instruct. (#2796)
- Removed the necessary of
--extra-index-url https://pypi.nvidia.com
when runningpip install tensorrt-llm
.
Infrastructure Changes
- The dependent NVIDIA ModelOpt version is updated to 0.27.
Known Issues
- The PyTorch workflow on SBSA is incompatible with bare metal environments like Ubuntu 24.04. Please use the PyTorch NGC Container for optimal support on SBSA platforms.
v0.20.0rc2
Highlights
- Model Support
- Added support for Qwen3 (#4010)
- Features
- Integrated Llama4 input processor (#3383)
- Added CGA reduction FHMA kernels on Blackwell (#3763)
- Implemented
LogitsProcessor
in PyTorch backend (#3145) - Unfused attention for native support (#3668)
- Added
group_rms_norm
kernel to normalize multiple inputs in a single operator (#3438) - Supported multiple LoRA adapters and TP (#3885)
- API
- Bug Fixes
- Fixed bug related to creating CUDA stream as default parameter, which will be initialized during importing (#3764)
- Benchmark
- Performance
- Infra
- Open-sourced XQA kernels (#3762)
- Documentation
- Known Issues
What's Changed
- feat: llama4 multimodal input processor by @milesial in #3383
- fix: [nvbug/5234873] Detect pmix and raise error when mpirun is not used. by @yuxianq in #3858
- fix: fix bug of deepseek gropu_size setting by @byshiue in #3860
- Infra: Remove empty junit xml by @EmmaQiaoCh in #3794
- fix: Update num_of_ctx_tokens in iteration stats by @HuiGao-NV in #3785
- cacheTransceiver buffer manager by @chuangz0 in #3798
- fix: add warmup flag into py_executor to prevent enable profiler during wa… by @byshiue in #3852
- fix: trtllm-bench build trt engine on slurm by @Superjomn in #3825
- infra: install Triton in the base image by @Tabrizian in #3759
- fix bug of create cuda stream as default parameter which will be init… by @byshiue in #3764
- Test: waive intermittent test hang by @chzblych in #3894
- [TRTLLM-4786] infra: add scaffolding paths to pytorch only files by @dc3671 in #3835
- test: [CI] Add failed cases into waives.txt by @xinhe-nv in #3887
- test: [CI] Add failed cases into waives.txt by @xinhe-nv in #3867
- Fix the link of doc by @litaotju in #3903
- [TRTLLM-4638 ][feat] add best of n support with reward model in scaffolding by @dc3671 in #3807
- Add docs about DeepSeek-R1 long context support. by @qiaoxj07 in #3910
- [https://nvbugs/5247300] fix(requirements): fix neither 'setup.py' nor 'pyproject.toml' found by @dc3671 in #3906
- chore: Make llama4 MoE use maybe_execute_in_parallel by @mikeiovine in #3779
- fix: Fixing minor typo in allreduce kernel selection by @hyukn in #3912
- test: add deepseek v3 & r1 cases by @VALLIS-NERIA in #3528
- [fix] Fix a few issues with EAGLE3 in PyTorch backend by @mikeiovine in #3686
- waive test_attention_no_cache by @hchings in #3921
- fix: Fix FMHA-based MLA in the generation phase and add MLA unit test by @jinyangyuan-nvidia in #3863
- chore: remove DummyKvCacheManager. by @yuxianq in #3896
- fix(test): remove random context seq lengths and set random seed by @qixiang-99 in #3919
- feat: fix erros on scaffolding README by @WeiHaocheng in #3899
- fix: [https://nvbugspro.nvidia.com/bug/5242406][fix] Fix fp8 kvcache support by @hlu1 in #3877
- feat: add CGA reduction fmha kernels on Blackwell. by @PerkzZheng in #3763
- [CI] increase H100 CI nodes for PyTorch only pipelines by @QiJune in #3927
- [TRTLLM-4883][fix]: Update output speed calculation. by @FrankD412 in #3923
- chore: add num_scheduled_requests into print_log by @byshiue in #3914
- fix: revert #3858 by @yuxianq in #3928
- chore: change log level of some text from info to debug by @byshiue in #3930
- [fix] optimize cudaMemGetInfo for TllmGenFmhaRunner by @zhhuang-nv in #3907
- chore: Mass integration of release/0.19 into main by @DomBrown in #3841
- feat: parallel q_b_proj and concat by @hello-11 in #3917
- refactor: (part1) Add contraints doc for fusedMoe module. by @HuiGao-NV in #3882
- fix: get head_dim from model’s config. by @yuxianq in #3916
- TRTLLM-4624 feat: Add nvfp4 gemm and moe support for SM120 by @VALLIS-NERIA in #3770
- [feat] support ModelOpt NemotronH FP8 quantized checkpoints in TRTLLM pytorch flow by @tomeras91 in #3891
- fix: change the seq_lens sync copy to an async one by @lfr-0531 in #3786
- [https://nvbugs/5178445][fix] Skip blackwell tests for sm120 by @pamelap-nvidia in #3815
- chore: skip pipeline parallelism test of pytorch flow by @QiJune in #3947
- [TRTLLM-4623][fix] sync internal cutlass kernel changes by @pamelap-nvidia in #3968
- chore: update multi-gpu trigger file list by @QiJune in #3971
- test: [CI] remove closed bugs by @xinhe-nv in #3890
- chore: Remove duplicated get_sm_version. by @yuxianq in #3935
- chore: bump version to 0.20.0rc2 by @ZhanruiSunCh in #3949
- perf: Optimise MOE prologue to use fused setup function by @djns99 in #3790
- chore: remove release branch codeowners from main by @tburt-nv in #3954
- fix: [https://nvbugspro.nvidia.com/bug/5243482] If FlashMLA is used, the existence of FMHA based MLA kernels should not be checked. by @bobboli in #3862
- unwaive disagg tests by @chuangz0 in #3925
- infra: open source XQA kernels by @ming-wei in #3762
- feat: Mistral-Large-2 support in the Pytorch workflow by @hypdeb in #3845
- chore: update internal_cutlass_kernels. by @nv-guomingz in #3973
- [fix] Pad requests to maximum draft length in spec decode by @mikeiovine in #3957
- infra: add conan by @tburt-nv in #3744
- waive test_tinyllama_guided_decoding by @hchings in #3997
- [TRTLLM-4460] test: Use Llama 3.2 1B for Llama C++ tests by @DomBrown in #3206
- refactor: Clean up allreduce module for Deepseek V3 model by @hyukn in #3829
- [feat]: Allow for a settable end-of-sequence/padding token in max throughput benchmark. by @FrankD412 in #3776
- feat: Add multimodal embedding field in LlmRequest by @katec846 in #3855
- Llama4 processor fixes by @milesial in #3994
- fix: Add attention workspace memory check by @hlu1 in #3970
- feat: add relaxed acceptance for DS by @yweng0828 in #3865
- fix:https://nvbugs/5246733 by @nv-guomingz in #3989
- model: support Qwen3 by @byshiue in #4010
- test: [CI] Add failed cases into waives.txt by @xinhe-nv in #3943
- feat: Support Top-K logprobs and prompt_logprobs in LLMAPI by @hchings in #3388
- [AutoDeploy] Make all ranks agree on kv-cache size by @suyoggupta in #4007
- feat: LogitsProcessor in PyTorch backend by @hchings in #3145
- fix: Fallback to NCCL for various patterns when input size is large. by @hyukn in #4009
- feat: [AutoDeploy] unfusing attention by @lucaslie in #3668
- feat: Add group_rms_norm kernel to normalize multiple inputs in a single operator. by @SimengLiu-nv in #3438
- model/infra: add ci and doc for qwen3 by @byshiue in #4022
- [Deepseek][fix] Fix Deepseek MTP with moe_backend=TRTLLM by @hlu1 in #4001
*...
v0.20.0rc1
Highlights
- Features
What's Changed
- move pytorch tests of LLM API into separate test files by @QiJune in #3745
- Fix double link to fp8_blockscale_gemm_src by @WilliamTambellini in #3707
- feat: add QMMA-based MLA kernels by @PerkzZheng in #3752
- chore: add pull request template by @byshiue in #3760
- Add running E2E LoRA flow by @shaharmor98 in #3648
- [infra] Waive L0 tests by @yiqingy0 in #3784
- feat: Add smart router for moe module by @zongfeijing in #3641
- test: add rcca tests 4753548 by @xinhe-nv in #3716
- fix: nvbugs/5234029 fix Qwen2.5-VL image test by @yechank-nvidia in #3726
- test: [CI] Add failed cases into waives.txt by @xinhe-nv in #3696
- fix: Intercept the error of multi-ranks bound to a single device by @Shixiaowei02 in #3525
- fix: remove the unnecessary metadata changes in mtp by @lfr-0531 in #3787
- test: Add DeepSeek-V3-Lite GSM8K tests by @syuoni in #3771
- infra: [TRTLLM-4417]Support auto trigger special test stage for special file change by @ZhanruiSunCh in #3478
- [TRTLLM-4763][test] Accuracy test improvement (Part 3.6): Deprecate mmlu_llmapi.py by @syuoni in #3802
- add passing E2E LoRA flow by @shaharmor98 in #3788
- fix: Limit llama4 context length to 8k by @mikeiovine in #3778
- fix: Fix C++ decoder synchronization in PyTorch by @dcampora in #3106
- fix: 5197419 and removed unused runtime kernels by @hypdeb in #3631
- chore: reorganize some unit tests of PyTorch by @QiJune in #3780
- doc: fix path after examples migration by @kaiyux in #3814
- chore: fix some invalid paths of contrib models by @QiJune in #3818
- chore: Fix KV cache block reuse flag name in quickstart_advanced by @mikeiovine in #3781
- Fix create_weights in attention by @hlu1 in #3692
- test: [CI] Add failed cases into waives.txt by @xinhe-nv in #3777
- [https://nvbugspro.nvidia.com/bug/5238602][fix] Package lm_eval configuration files by @syuoni in #3809
- [https://nvbugspro.nvidia.com/bug/5238599][fix] Normalize example path in accuracy tests by @syuoni in #3805
- fix: Set default prompts and media for multimodal quickstart example by @qixiang-99 in #3792
- Fix: Revert commit 25f9669 by @Shixiaowei02 in #3832
- chore: bump version to 0.20.0rc1 by @ZhanruiSunCh in #3834
- feat(part 2): Enhance the integrated robustness of scaffolding with init_.py #3305 by @WeiHaocheng in #3731
- fix: fix lora case failure by @HuiGao-NV in #3838
- Added NemotronH to PyTorch supported models by @vegaluisjose in #3663
- Adding local paths to the datasets to make them loadable in offline mode by @rakib-hasan in #3750
- fix: [Deepseek] Pass hidden_states_fp4 to shared_experts by @hlu1 in #3819
- chore: increase A30 for cpp test by @QiJune in #3811
- feat: Return logits in PyTorch flow by @tongyuantongyu in #3221
- feat: large-scale EP(part 1: Add MNNVL MoE A2A support) by @dongxuy04 in #3504
- [infra] Waive L0 tests by @yiqingy0 in #3853
- [chore] Add Llama 4 Maverick to quickstart README by @mikeiovine in #3848
- fix:[AutoDeploy] Patch for torch load_state_dict() by @sugunav14 in #3847
- feat: Add head size 72 support for QKV Preprocessing kernel by @qixiang-99 in #3743
- chore: update pytorch only change file list by @QiJune in #3873
- Test: Split C++ unit tests for CI granularity by @DomBrown in #3868
- TRTLLM-4875 feat: Add version switcher to doc by @kaiyux in #3846
Full Changelog: v0.20.0rc0...v0.20.0rc1
v0.20.0rc0
Highlights
- Model Support
- Features
- Added stream generation task scaffolding examples (#3527)
- Added unfused RoPE support in MLA (#3610)
- Multimodal models
- [Experimental] The TensorRT-LLM Triton backend has supported the LLM API (triton-inference-server/tensorrtllm_backend#742)
- Performance
- Optimized Large Embedding Tables in Multimodal Models (#3380)
- Infra
- Dependent
datasets
version was upgraded to 3.1.0 (#3490)
- Dependent
What's Changed
- chore: Unify Python NVTX call by @kaiyux in #3450
- doc: genai-perf benchmark & slurm multi-node for trtllm-serve doc by @LinPoly in #3407
- fix: disable KV cache reuse if using attention sink by @Funatiq in #3021
- doc: Minor fixes for documents by @kaiyux in #3577
- fix: nvbugs/5075538: fix cross attention mask when decoder input len > 1 by @VALLIS-NERIA in #3585
- chore: Mass integration of release/0.18 by @dcampora in #3421
- fix: LLM API _hf_model_dir for non-cached case by @syuoni in #3562
- ci: waive test_llm_multi_node_pytorch by @Superjomn in #3592
- fix: amend trtllm-bench command in the test by @Superjomn in #3563
- feat: Add stream generation task scaffolding examples by @narutolhy in #3527
- chore: bump version to 0.20.0rc0 by @ZhanruiSunCh in #3561
- chore: Add comments to modifications that fix TP size of DeepSeek-V3/R1 when using more than 16 GPUs by @jinyangyuan-nvidia in #3572
- chore: waive test_llm_phi_quantization_1gpu by @QiJune in #3603
- feat: Support cos_sin_cache in all cases. by @yuxianq in #3517
- fix: add SM90 guard for FP8 Blockscale GEMM by @lucifer1004 in #3575
- infra: Update user list by @niukuo in #3614
- feat: Adding FP8 BMM from Codegen by @evezhier in #3541
- waive test_llm_multi_node_with_postproc by @QiJune in #3628
- fix: Use hmac authentication for pickle encryption by @yibinl-nvidia in #3384
- Clean up linear.py, mlp.py, gated_mlp.py by @hlu1 in #3553
- feat: Support CUDA graphs for EAGLE3 by @mikeiovine in #3176
- feat: Nemotron-H model support by @vegaluisjose in #3430
- waive test_fp8_scaled_mm by @QiJune in #3637
- disable ib for ucx test by @chuangz0 in #3613
- tests: change qa perf test to trtllm-bench by @ruodil in #3189
- test: add quickstart test for nemotron-ultra by @crazydemo in #3596
- feat: Add support for smaller hidden_dim in AR fusion kernel by @yilin-void in #3609
- Fix rotary_emb param in NemotronH attention by @vegaluisjose in #3646
- chore: Use ellipsis as default value to detect whether residual argument is provided by @yuxianq in #3626
- feat/loraOp by @danielafrimi in #3455
- Cherry-pick: update fp8 doc (#3647) by @litaotju in #3650
- test: [CI] Add failed cases into waives.txt by @xinhe-nv in #3627
- Waive L0 tests by @yiqingy0 in #3651
- feat: allocate minimal blocks per window size by @netanel-haber in #3028
- test: remove benchmark test list on main branch by @crazydemo in #3644
- feat: Support unfused rope in MLA. by @yuxianq in #3610
- fix: Fix fused_moe cache fallback issue. by @hyukn in #3652
- fix: Correct reporting of text dtype for Llama 4. by @FrankD412 in #3494
- chore: waive test_llm_multi_node by @QiJune in #3664
- chore: update multi gpu trigger file list by @QiJune in #3665
- feat: adding multimodal (only image for now) support in trtllm-bench by @rakib-hasan in #3490
- fix sage attention headsize check error in bertAttentionPlugin.cpp by @Jackch-NV in #3660
- fix: llama4: address couple of issues in llama4 attention module by @chang-l in #3491
- chore: Refactor test_disaggregated.py by @Tabrizian in #3154
- test: Add llama 4 to ci by @dongfengy in #3520
- chore : Split more tests out of gpt tests by @peaceh-nv in #3524
- infra: Add step to generate new duration file by @EmmaQiaoCh in #3298
- refactor: Clean up CMakeLists.txt by @tongyuantongyu in #3479
- test: Unwaive test for nvbug_5150466 by @hchings in #3552
- feat: Add Dynasor-CoT in scaffolding examples by @Fsanic in #3501
- feat: Integrate GPUDirect Storage (GDS) into Executor API by @DomBrown in #3582
- Remove dummy forward path by @HuiGao-NV in #3669
- fix: hmac in remote mpi session by @Superjomn in #3649
- test: add kv cache event tests for disagg workers by @zhengd-nv in #3602
- chore: enable test_ptp_quickstart_advanced_mixed_precision back by @QiJune in #3667
- feat: Disaggregated router class by @pcastonguay in #3584
- Updating the run.py to make the draft target model run with the LLaMa 3 1B/8B by @mayani-nv in #3615
- feat: trtllm-serve multimodal support by @yechank-nvidia in #3590
- chore: Waive disaggregated load balance by @Tabrizian in #3687
- Clean up modeling_deepseek.py by @hlu1 in #3640
- fix: Fix disaggregated load balance test by @Tabrizian in #3689
- feat: Introduce feature properties for attention backend. by @yuxianq in #3659
- test:update waives.txt for nvbug 5219532 by @nv-guomingz in #3672
- test: Get Eagle tests working by @brb-nv in #3593
- move the reset models into
examples/models/core
directory by @QiJune in #3555 - fix: Refactor Deepseek tp_size calculation by @hlu1 in #3695
- Update Nemotron Super and Ultra in Supported Models and add an example by @Naveassaf in #3632
- infra: Add test list name check by @EmmaQiaoCh in #3097
- feat: [Deepseek] Add trtllm-gen MOE FP4 MOE backend by @hlu1 in #3387
- Waive L0 tests by @yiqingy0 in #3709
- fix: update test_user_buffers_mm_add_prologue atol (#3711) by @liji-nv in #3713
- fix: Support TLLM_OVERRIDE_LAYER_NUM for llama4. by @yuxianq in #3679
- Report number of context tokens in one iteration by @HuiGao-NV in #3691
- fix: Remove ParallelConfig. by @yuxianq in #3678
- feat: Offloading Multimodal embedding table to CPU in Chunked Prefill Mode by @katec846 in #3380
- fix: fix cublas_scaled_mm by @dc3671 in #3600
- chore: update FMHA cubin files by @jinyangyuan-nvidia in #3680
- test: add llama3.2 ptp test case by @StanleySun639 in #3363
- bug: Fix hang bug when context server doesn't have enough capacity for KV Cache by @Tabrizian in #3095
- refact: use pybind block key and hasher in disagg worker test by @zhengd-nv in #3712
- Fix: nvbugs/5232457 ModelOpt Mixtral AWQ OOM by @Barry-Delaney in #3714
- ci: unwaive multi_node test by @Superjomn in https://github.com/...
v0.19.0rc0
- Model Support
- Features
- Added FP8 support for SM120 architecture (#3248)
- Registered
ENABLE_MULTI_DEVICE
andENABLE_UCX
as CMake options (#3343) - Made the scaffolding Controller more generic (#3416)
- Breaking change: Added individual gatherContext support for each additional output (#3374)
- Added trtllm‑gen FP4 GEMM for the PyTorch workflow (#3423)
- Added Qwen2 MoE support for PyTorch flow (#3369)
- Enabled
PyExecutor
inference flow to estimatemax_num_tokens
forkv_cache_manager
(#3092) - Added
TLLM_OVERRIDE_LAYER_NUM
andTLLM_TRACE_MODEL_FORWARD
environment variables for debugging (#3417) - Applied the PyTorch workflow compatible
AutoTuner
to both Fused MoE and NVFP4 Linear operators (#3151) - Introduced a
UserBuffers
allocator for PyTorch flow (#3257) - Supported aborting disconnected requests (#3214)
- Added support for FP8 MLA on NVIDIA Hopper and Blackwell GPUs (#3190)
- Added an option to run disaggregated serving without context servers (#3243)
- Enhanced RoPE support in AutoDeploy (#3115)
- Fixed and improved allreduce and fusion kernels (#3064)
- Added DeepSeek-V3 support in AutoDeploy (#3281)
- Enhanced the integrated robustness of scaffolding via
init.py
(#3312)
- API
- Bug fixes
- Fixed a wrong import of
KvCacheConfig
inexamples/gpqa_llmapi.py
(#3369) - Fixed the test name (#3534)
- Fixed
max_seq_len
inexecutor_config
(#3487) - Removed a duplicated line of code (#3523)
- Disabled kv cache reuse for the prompt tuning test (#3474)
- Fixed the issue of a first‑generation token being returned twice in streaming (#3427)
- Added kv memory size per token calculation in the draft model (#3497)
- Switched ZMQ from a file socket to a TCP socket in RemoteMpiCommSession (#3462)
- Fixed PP for Llama (#3449)
- Updated the default excluded_modules value for the fp8rowwise recipe (#3477)
- Fixed disaggregation MTP with overlap (#3406)
- Stopped memory estimation in start_attention (#3485)
- Allowed the
context_and_generation
request type in disaggregated overlap (#3489) - Fixed the partial match issue (#3413)
- Fixed Eagle decoding (#3456)
- Fixed the
py_decoding_iter
update in the decoder (#3297) - Fixed the beam search diversity issue (#3375)
- Updated ucxx to avoid occasional segfaults when profiling (#3420)
- Fixed redrafter sampling (#3278)
- Fixed mllama end‑to‑end PyTorch flow (#3397)
- Reverted an extra CMake variable (#3351)
- Fixed issues with the fused MoE path (#3435)
- Fixed conflicting test names (#3316)
- Fixed failing DeepSeek-V3 unit tests (#3385)
- Fixed missing bias addition for
FP4Linear
(#3361) - Fixed the runtime error in
test_deepseek_allreduce.py
(#3226) - Fixed speculative decoding and multimodal input support (#3276)
- Fixed PyTorch nvsmall via
PyExecutor
and improved TP support (#3238) - Fixed the p‑tuning test bug (#3326)
- Fixed a wrong import of
- Performance
- Cached sin and cos in the model instead of using a global LRU cache (#3378)
- Deallocated tensors after use in MLA (#3286)
- Enabled DeepGEMM by default (#3341)
- Added a thread leak check and fixed thread/memory leak issues (#3270)
- Used cudaMalloc to allocate kvCache (#3303)
- Made ipc_periodically the default responses_handler (breaking change) (#3102)
- Used NVRTC for DeepGEMM JIT compilation (#3239)
- Optimized quantization kernels used in DeepSeek on Hopper (#3466)
- Documentation
- Added an example section for the multi‑node DeepSeek R1 benchmark on GB200 (#3519)
- Documented disaggregation performance tuning (#3516)
- Updated the perf‑benchmarking documentation for GPU configuration (#3458)
- Updated the README and added a benchmarking blog for DeepSeek‑R1 (#3232)
- Updated the documentation for using Draft‑Target‑Model (DTM) (#3366)
- Updated the README for disaggregated serving (#3323)
- Updated instructions to enable FP8 MLA for Deepseek. (#3488)
Full change log: 5aeef6d...258ae9c.
TensorRT-LLM Release 0.18.2
Key Features and Enhancements
- This update addresses known security issues. For the latest NVIDIA Vulnerability Disclosure Information visit https://www.nvidia.com/en-us/security/.
TensorRT-LLM Release 0.18.1
Key Features and Enhancements
- The 0.18.x series of releases builds upon the 0.17.0 release, focusing exclusively on dependency updates without incorporating features from the previous 0.18.0.dev pre-releases. These features will be included in future stable releases.
Infrastructure Changes
- The dependent
transformers
package version is updated to 4.48.3.