Skip to content

Releases: NVIDIA/TensorRT-LLM

v0.21.0rc1

11 Jun 05:27
9c012d5
Compare
Choose a tag to compare
v0.21.0rc1 Pre-release
Pre-release

Highlights

  • Model Support
    • Add HyperCLOVAX-SEED-Vision support for PyTorch flow (#4799)
  • Features
    • Support generation logits in TRTLLM Sampler (#4819
    • Support for large scale-EP(#4818)
    • Support XQA-based MLA on SM120 (#4858)
    • Add PositionEmbeddingType=0 to xqa support (#4934)
    • Add cache reuse support (selective cache transfer) in mla cache formatter (#4749)
    • Update DeepSeek FP8 TRT-LLM Gen cubins (#4643)
    • Add heuristics for checkpoint files prefetching (#4765)
    • Enable NVFP4 output for TRTLLM attention kernels (#4737)
    • Refactor Fused MoE (#4790)
    • Add integration of etcd (#3738)
    • Memoize weight shuffle index to speed up weight preproc in moe_backend=TRTLLM (#4826)
    • Enable Disaggregated serving for QWen-3 (#4929)
  • API
    • Set _AutoDeployLlmArgs as primary config object (#4891)
  • Bug Fixes
    • Fix warmup phase batch size out of range (#4986)
    • Fix buffer count (#5007)
    • Fix nvbug 5324252 test_resource_manager.py broken (#4925)
    • Fix nvbug 5280806 2 model spec decode flow (#4807)
    • Fix nvbug 5324248 test_pytorch_model_engine.py broken (#4973)
    • Fix cuda graph padding for spec decoding (#4853)
    • Correct the order of llm request state (#4781)
    • Handle OOMs during KV cache estimation (#4690)
    • Only pass fast_build=true to non-pytorch backend (#4920)
    • Fix the no fusion all reduce hanging (#4594)
    • Deprecate AutoDeploy CI post-merge tests and keep them for local testing (#4892)
    • Fix nvbug 5302895 test_trtllm_bench_llmapi_launch fail(#4835)
    • Fix llama 4 long context issue (#4809)
    • Fix nvbug 5300080 the bug of setting attention_chunk_size and enable
    • chunked-attention in the generation-phase by default (#4693)
    • Fix nvbug 5294316 queued request stats (#4714)
    • Fix max_num_sequences calculation with overlap scheduling (#4532)
    • Fix trtllm-bench hang issue due to LLM API IPC (#4798)
    • Fix a pd+mtp accuracy issue (#4536)
  • Benchmark
    • Add beam width to low latency. (#4812)
    • Fix trtllm-bench iter_stats and cuda_graph_batch_sizes errors. (#4827)
  • Performance
  • Infrastructure
    • TRT-LLM team formally releases docker image on NGC.
    • Update jnlp version in container image (#4944)
    • Upgraded ModelOpt to 0.31.0 (#5003)
    • Upgrade Flash-infer to 0.2.5 (#5004)
  • Documentation
    • doc: Document the docker release image on NGC #4705
    • Fix readme for disaggregated serving (#4846)
    • Fix draft target README and set exclude_input_in_output to False (#4882)
    • blog: Scaling Expert Parallelism in TensorRT-LLM (Part 1: Design and Implementation of Large-scale EP) #4958
  • Known Issues

What's Changed

Read more

v0.21.0rc0

04 Jun 02:48
9ae2ce6
Compare
Choose a tag to compare
v0.21.0rc0 Pre-release
Pre-release

Highlights

  • Model Support
  • Features
    • Support for large-scale EP (#4384, #4495 , #4615)
    • Added chunked attention kernels (#4291, #4394)
    • ScaffoldingLLM now supports MCP (#4410)
    • Integrated NIXL into the communication layer of the disaggregated service (#3934, #4125)
    • Integrated Hopper chunked attention kernels (#4330)
    • Enabled TRT backend for Python runtime in disaggregated service (#4243)
    • Added FP8 block-scale GEMM support on SM89 (#4481)
    • Qwen3 FP4 MoE TRTLLM backend for low-latency (#4530)
    • Introduced sliding-window attention kernels for the generation phase on Blackwell (#4564)
    • Vanilla MOE added (#4682)
    • Fused QKNorm + RoPE integration (#4611)
    • Fabric Memory support for KV Cache Transfer (#4717)
  • API
  • Bug Fixes
    • Resolved Torch compile issue for DeepSeek V3 (#3952)
    • Fixed trtllm-llmapi-launch for single-node, single-GPU setups (#4428)
    • Removed duplicate tokenization in generation server (#4492)
    • Fixed cancel request handling for attentionDP (#4648)
    • Fixed disaggregated service hang when MNNVL two-shot AllReduce is enabled (#4678)
    • Fixed queued request statistics (#4806)
    • Fixed EP load balancer with MTP layer and route offset by EP rank (#4767)
    • Resolved accuracy and illegal memory access issues with MTP + attention DP (#4379)
  • Benchmark
    • Added all_reduce.py benchmark script for testing (#4537)
  • Performance
  • Infrastructure
    • Integrated NGC image into Makefile automation and documentation (#4400)
    • Built Triton for ARM architecture (#4456)
    • Added triton release container (#4455)
    • Refactored Docker build image (Groovy) and added NGC image support (#4294)
    • Upgraded Cutlass to version 4.0 (#4794)
  • Documentation
    • Updated descriptions for NGC Docker images (#4702, #4705)
  • Known Issues
    • Two important fixes are NOT included in this release, but they are already in the main branch
      • Fix the bug of setting attention_chunk_size and enable chunked-attention in the generation-phase by default (#4693)
      • Fixed the failure of the LLMAPI benchmark caused by a serialization issue (#4835)

What's Changed

  • Refine doc by @juney-nvidia in #4420
  • Refine doc by @juney-nvidia in #4421
  • refine doc by @juney-nvidia in #4422
  • Remove vila test by @Tabrizian in #4376
  • [TRTLLM-4618][feat] Add Nemotron Super 49B FP8 test on RTX6000 Pro (SM120) by @farazkh80 in #4363
  • tests: add qa test mentioned in docs by @crazydemo in #4357
  • [Infra] - Always push the release images in the post-merge job by @chzblych in #4426
  • tests: Add test cases for rcca cases by @crazydemo in #4347
  • chore: cleanup perf_evaluator code by @Superjomn in #3833
  • feat: Add pp support for hybrid attn/mamba model by @yuxianq in #4358
  • fix: wrong argument name enable_overlap_scheduler by @kaiyux in #4433
  • Update "Roadmap" link under README.md to the issues with Roadmap label by @AdamzNV in #4425
  • fix potential issues in allreduce fusion kernel and ut by @yilin-void in #4226
  • [TRTLLM-4638] feat(scaffolding): update Reward Controller to PRM specific controller with step split by @dc3671 in #4337
  • feat: NIXL interface integration by @Shixiaowei02 in #3934
  • Downgrade the logger level for fallback tactic warning. by @hyukn in #4440
  • Test: Improve model re-use in C++ DGX tests for CI stability by @DomBrown in #4263
  • fix: temp disable the problem test by @Shixiaowei02 in #4445
  • Add llama4 disagg accuracy tests by @Tabrizian in #4336
  • [https://nvbugs/5123103][fix] Fix torch compile for DeepSeekV3 by @liji-nv in #3952
  • [Docs] - Reapply #4220 by @chzblych in #4434
  • [TRTLLM-4618][feat] Fix cutlass MoE GEMM fallback failure on FP8 + add e2e test for Mixtral 8x7B FP8 on RTX6000 Pro (SM120) by @farazkh80 in #4335
  • [Feat] add chunked-attention kernels on Hopper (for llama4) by @PerkzZheng in #4291
  • test(perf): Add some Llama-3_3-Nemotron-Super-49B-v1 integration-perf-tests (TRT flow, trtllm-bench) by @venkywonka in #4128
  • fix: [nvbugs/5287097] Align PP layer distribution between pytorch and TRT flow. by @yuxianq in #4399
  • feat: Low Precision Allreduce for PCIe based GPU by @kanghui0204 in #4344
  • test: [CI] Add failed cases into waives.txt by @xinhe-nv in #4429
  • [TRTLLM-4932] Add CLI accuracy tests for Llama-3.3-70B-Instruct and LLM API BF16 variant by @moraxu in #4362
  • test: update test filter in perf test yml file to select cases by gpu name and add cases for RTX 6000 pro by @ruodil in #4282
  • [AutoDeploy] HF factory improvements by @lucaslie in #4371
  • chore: bump version to 0.21.0rc0 by @ZhanruiSunCh in #4465
  • doc: [TRTLLM-325]Integrate the NGC image in Makefile automation and document by @MartinMarciniszyn in #4400
  • infra: Add qwen3 235B tests into QA by @byshiue in #4483
  • feat: large-scale EP(part 2: MoE Load Balancer - core utilities) by @dongxuy04 in #4384
  • [TRTLLM-5085][fix] Nemotron H correctness test by @tomeras91 in #4444
  • [Docs] - Add date and commit info by @chzblych in #4448
  • fix[nvbug/5286515]: trtllm-llmapi-launch on single node single gpu by @Superjomn in #4428
  • fix: replace the image links in the blog by @Shixiaowei02 in #4489
  • fix: Fix TRTLLMSampler beam width bug. by @dcampora in #4473
  • refactor: Unify request order in TRT and PyTorch workflow by @Funatiq in #4096
  • [TRTLLM-5273]feat/Use full attention mask if Llama3 is used as encoder and fix EarlyStopDecoder unsqueeze bug by @nvrohanv in #4290
  • Build Triton for arm by @Tabrizian in #4456
  • test: [CI] remove closed bugs by @xinhe-nv in #4417
  • test(perf): Add remaining Phi-4-mini-instruct perf tests by @venkywonka in #4443
  • feat: conditional disaggregation in disagg server by @zhengd-nv in #3974
  • perf: Fuse gemm setup function for SM90/SM100 MOE plugin path by @djns99 in #4146
  • fix: skip weights defined in create_weights for pp. by @yuxianq in #4447
  • Feat: add chunked-attention kernels on Blackwell by @PerkzZheng in #4394
  • fix [nvbug/5220766]: llmapi-launch add add trtllm-bench test with engine building by @Superjomn in #4091
  • [TRTLLM-5000][feat] Pytorch implementation of ngram drafter by @thorjohnsen in #3936
  • test: NIXL single process test by @Shixiaowei02 in #4486
  • Chore: waive torch compile test cases of deepseek v3 lite by @QiJune in #4508
  • Feat: add deep_gemm swapab Kernel by @ruoqianguo in #4430
  • unwaive some disagg test by @chuangz0 in #4476
  • Clean: fmha codes by @PerkzZheng in #4496
  • tests: add llama 3.3 70b 2 nodes tests by @xinhe-nv in #4391
  • CI: waive test_fp8_block_scales_4gpus of deepseek v3 lite by @QiJune in #4520
  • test: remove enable_overlap_schedule in pytorch config and set enable_chunked prefill to be true for isl>2048 cases by @ruodil in #4285
  • docs: update the introduction for scaffolding by @WeiHaocheng in #4360
  • test: add failed case in waive list and fix some test script issue for perf test by @ruodil in #4527
  • tests: add qwene fp4 tests into QA test list & update sanity test list by @xinhe-nv in #4478
  • feat: large-scale EP(part 3: refactor - FusedMoe for redundant expert) by @dongxuy04 in #4495
  • refactor: DisaggExecutorTest by @Funatiq in #4398
  • chore: clean ucx and nixl mirror. by @nv-guomingz in h...
Read more

v0.20.0rc3

20 May 09:42
039f7e3
Compare
Choose a tag to compare
v0.20.0rc3 Pre-release
Pre-release

Highlights

  • Model Support
    • Support Mistral Small 3.1 24B VLM in TRT workflow (#4183)
    • Support Gemma3-1b-it in PyTorch workflow (#3999)
  • Features
    • Adopt new logprob definition in PyTorch flow (#4057)
    • Support multiple LoRA adapters and TP (#3885)
    • Add Piecewise CUDA Graph support (#3804)
    • Add KV cache-aware router for disaggregated serving (#3831)
    • Enable per-request stats with PyTorch backend (#4156)
    • Support DeepSeek-R1 W4A8 on Hopper (#4123)
    • Enable chunked context for FlashInfer (#4132)
    • Support KV cache reuse for MLA (#3571)
  • API
    • Allow overriding CLI arguments with YAML file in trtllm-serve (#4164)
    • Remove deprecated GptSession/V1 from TRT workflow (#4092)
  • Bug Fixes
    • Fix attention DP bug on Qwen3 MoE model (#4141)
    • Fix illegal memory access when running LLaMA 4 with CUDA Graph enabled (#4101)
  • Benchmark
    • Remove deprecated Python runtime benchmark (#4171)
    • Add benchmark support for scaffolding (#4286)
  • Performance
  • Infrastructure
    • The base Docker image for TensorRT-LLM is updated to nvcr.io/nvidia/pytorch:25.04-py3 (#4049)
    • The dependent TensorRT version is updated to 10.10.0 (#4049)
    • The dependent CUDA version is updated to 12.9.0 (#4049)
    • The dependent public PyTorch version is updated to 2.7.0.
    • The pre-built TensorRT-LLM wheel on PyPI is linked against PyTorch 2.7.0 now, which uses the CXX11 ABI (#4235)
  • Documentation
  • Known Issues

What's Changed

Read more

v0.19.0

09 May 12:55
c6f7d42
Compare
Choose a tag to compare

TensorRT-LLM Release 0.19.0

Key Features and Enhancements

  • The C++ runtime is now open sourced.
  • PyTorch workflow
    • Added DeepSeek V3/R1 support. Refer to examples/deepseek_v3/README.md, also to the blog docs/source/blogs/Best_perf_practice_on_DeepSeek-R1_in_TensorRT-LLM.md.
    • Added Llava-Next support.
    • Added BERT support.
    • Added a C++ based decoder, which added support for:
      • TopK / TopP.
      • Bad words.
      • Stop words.
      • Embedding bias.
    • Added Autotuner for custom-op-compatible tuning process.
      • Added a Python-based Autotuner core framework for kernel tuning.
      • Applied the Autotuner to fused MoE and NVFP4 linear operators for concept and performance evaluations.
    • Added guided decoding support (XGrammar integration).
    • Added pipeline parallelism support for the overlap scheduler in PyExecutor.
    • Added Qwen2VL model support.
    • Added mixed precision quantization support.
    • Added pipeline parallelism with attention DP support.
    • Added no-cache attention support.
    • Added PeftCacheManager support.
    • Added Qwen2.5‑VL support and refactored Qwen2‑VL.
    • Added trtllm‑gen FP4 GEMM support.
    • Added Qwen2 MoE support.
    • Applied AutoTuner to both Fused MoE and NVFP4 Linear operators.
    • Introduced a UserBuffers allocator.
    • Added Deepseek eager mode AllReduce fusion support.
    • Added Multi-Token Prediction (MTP) support. Refer to the “Multi-Token Prediction (MTP)” section of examples/deepseek_v3/README.md.
    • Added FlashMLA support for SM90.
    • Added support for enabling MTP with CUDA graph padding.
    • Added initial EAGLE-3 implementation.
    • Added support for FP8 MLA on NVIDIA Hopper and Blackwell GPUs.
  • AutoDeploy for PyTorch workflow.
    • The AutoDeploy for PyTorch workflow is an experimental feature in tensorrt_llm._torch.auto_deploy.
    • AutoDeploy provides an automated path from off-the-shelf models to optimized deployment in the TensorRT-LLM runtime.
    • Check out examples/auto_deploy/README.md for more details.
  • LLM API
    • [BREAKING CHANGE] Added dynamic logits processor support, and deprecated static logits processor.
    • Added batched logits processor support.
    • Added EAGLE support.
    • Added abort request support.
    • Added get_stats support.
    • Added multi-node support for Slurm-based clusters, refer to examples/llm-api/llm_mgmn_*.sh.
  • Added InternLM-XComposer2 support. Refer to “InternLM-XComposer2” section in examples/multimodal/README.md.
  • Added INT4-AWQ support for MoE models. Refer to the “AWQ Quantization” section in examples/mixtral/README.md.
  • Added Qwen2-Audio support. Refer to examples/qwen2audio/README.md.
  • Added Language-Adapter support. Refer to examples/language_adapter/README.md.
  • Added STDiT for OpenSoRA text-to-video support. Refer to examples/stdit/README.md.
  • Added vision encoders with tensor parallelism and context parallelism support. Refer to examples/vit/README.md.
  • Added EXAONE-Deep support. Refer to examples/exaone/README.md.
  • Added support for Phi-4-mini and Phi‑4‑MM.
  • Added Gemma3 text‑only model support. Refer to "Run Gemma 3" section at examples/gemma/README.md.
  • Added FP8 quantization support for Qwen2-VL.
  • Added batched inference support for the LLM API MMLU example examples/mmlu_llmapi.py.
  • Added FP4 quantization-layernorm fusion plugin support. (Llama models only)
  • Added Mamba-Hybrid support.
  • Added NVILA video support. The support includes 1 prompt - N media and N prompt - N media batching modes.
  • Added a --quantize_lm_head option examples/quantization/quantize.py to support lm_head quantization.
  • Added batched tensor FP4 quantization support.
  • Added a /metrics endpoint for trtllm-serve to log iteration statistics.
  • Added LoRA support for Phi-2 model.
  • Added returning context logits support for trtllm-serve.
  • Added one-shot version for UserBuffer AllReduce-Normalization on FP16/BF16.
  • Added request BW metric measurement for disaggServerBenchmark.
  • Updated logits bitmask kernel to v3.
  • Enabled CUDA graphs when attention DP was used and active requests on different GPUs were uneven.
  • Added iteration log support for trtllm-bench.
  • fp8_blockscale_gemm is now open-sourced.
  • Added AWQ support for ModelOpt checkpoints.
  • Added Linear block scale layout support in FP4 quantization.
  • Added pre-quantized FP8 checkpoint support for Nemotron-mini-4b-instruct.
  • Added Variable-Beam-Width-Search (VBWS) support (part2).
  • Added LoRA support for Gemma.
  • Refactored scaffolding worker, added OpenAI API worker support.
  • Optionally split MoE inputs into chunks to reduce GPU memory usage.
  • Added UCX IP interface support.
  • [BREAKING CHANGE] Added output of first token to additional generation outputs.
  • Added FP8 support for SM120 architecture.
  • Registered ENABLE_MULTI_DEVICE and ENABLE_UCX as CMake options.
  • Made the scaffolding Controller more generic.
  • Breaking change: Added individual gatherContext support for each additional output.
  • Enabled PyExecutor inference flow to estimate max_num_tokens for kv_cache_manager.
  • Added TLLM_OVERRIDE_LAYER_NUM and TLLM_TRACE_MODEL_FORWARD environment variables for debugging.
  • Supported aborting disconnected requests.
  • Added an option to run disaggregated serving without context servers.
  • Fixed and improved allreduce and fusion kernels.
  • Enhanced the integrated robustness of scaffolding via init.py.

API Changes

  • Exposed kc_cache_retention_config from C++ executor API to the LLM API.
  • Moved BuildConfig arguments to LlmArgs.
  • Removed speculative decoding parameters from stateful decoders.
  • Exposed DecoderState via bindings and integrated it in decoder.
  • Refactored the LlmArgs with Pydantic and migrated remaining pybinding configurations to Python.
  • Refactored disaggregated serving scripts.
  • Added numNodes to ParallelConfig.
  • Redesigned the multi‑stream API for DeepSeek.

Fixed Issues

  • Fixed misused length argument of PluginField. This also fixes #2685.
  • Fixed a Llama-3.2 SmoothQuant convert checkpoint issue. (#2677)
  • Fixed a bug when loading an engine using LoRA through the LLM API. (#2782)
  • Fixed incorrect batch slot usage in addCumLogProbs kernel.
  • Fixed incorrect output for Llama-3.2-11B-Vision-Instruct. (#2796)
  • Removed the necessary of --extra-index-url https://pypi.nvidia.com when running pip install tensorrt-llm.

Infrastructure Changes

  • The dependent NVIDIA ModelOpt version is updated to 0.27.

Known Issues

  • The PyTorch workflow on SBSA is incompatible with bare metal environments like Ubuntu 24.04. Please use the PyTorch NGC Container for optimal support on SBSA platforms.

v0.20.0rc2

13 May 09:27
74df12b
Compare
Choose a tag to compare
v0.20.0rc2 Pre-release
Pre-release

Highlights

  • Model Support
    • Added support for Qwen3 (#4010)
  • Features
    • Integrated Llama4 input processor (#3383)
    • Added CGA reduction FHMA kernels on Blackwell (#3763)
    • Implemented LogitsProcessor in PyTorch backend (#3145)
    • Unfused attention for native support (#3668)
    • Added group_rms_norm kernel to normalize multiple inputs in a single operator (#3438)
    • Supported multiple LoRA adapters and TP (#3885)
  • API
    • Introduced multimodal embedding field in LlmRequest (#3855)
    • Enabled overriding CLI arguments with YAML file in trtllm-serve (#4164)
  • Bug Fixes
    • Fixed bug related to creating CUDA stream as default parameter, which will be initialized during importing (#3764)
  • Benchmark
  • Performance
  • Infra
    • Open-sourced XQA kernels (#3762)
  • Documentation
  • Known Issues

What's Changed

Read more

v0.20.0rc1

29 Apr 08:54
d747223
Compare
Choose a tag to compare
v0.20.0rc1 Pre-release
Pre-release

Highlights

  • Features
    • PyTorch workflow
    • Part 1 of large-scale EP: Added MNNVL MoE A2A support. (#3504)
    • Added smart router for the MoE module. (#3641)
    • Added head size 72 support for QKV preprocessing kernel. (#3743)

What's Changed

Full Changelog: v0.20.0rc0...v0.20.0rc1

v0.20.0rc0

23 Apr 15:42
b16a127
Compare
Choose a tag to compare
v0.20.0rc0 Pre-release
Pre-release

Highlights

  • Model Support
    • Added Nemotron-H model support (#3430)
    • Added Dynasor-CoT in scaffolding examples (#3501)
  • Features
    • Added stream generation task scaffolding examples (#3527)
    • Added unfused RoPE support in MLA (#3610)
    • Multimodal models
      • Added support in trtllm-serve (#3590)
      • Added support in trtllm-bench, the support is limited to image only for now (#3490)
    • [Experimental] The TensorRT-LLM Triton backend has supported the LLM API (triton-inference-server/tensorrtllm_backend#742)
  • Performance
    • Optimized Large Embedding Tables in Multimodal Models (#3380)
  • Infra
    • Dependent datasets version was upgraded to 3.1.0 (#3490)

What's Changed

Read more

v0.19.0rc0

18 Apr 23:19
258ae9c
Compare
Choose a tag to compare
v0.19.0rc0 Pre-release
Pre-release
  • Model Support
    • Added Llama 4 support. (#3302)
    • Added support for Phi‑4‑MM (#3296)
    • Added Gemma3 text‑only model support. Refer to "Run Gemma 3" section at examples/gemma/README.md. (#3247)
    • Added Qwen2.5‑VL support for PyTorch workflow and refactored Qwen2‑VL (#3156)
  • Features
    • Added FP8 support for SM120 architecture (#3248)
    • Registered ENABLE_MULTI_DEVICE and ENABLE_UCX as CMake options (#3343)
    • Made the scaffolding Controller more generic (#3416)
    • Breaking change: Added individual gatherContext support for each additional output (#3374)
    • Added trtllm‑gen FP4 GEMM for the PyTorch workflow (#3423)
    • Added Qwen2 MoE support for PyTorch flow (#3369)
    • Enabled PyExecutor inference flow to estimate max_num_tokens for kv_cache_manager (#3092)
    • Added TLLM_OVERRIDE_LAYER_NUM and TLLM_TRACE_MODEL_FORWARD environment variables for debugging (#3417)
    • Applied the PyTorch workflow compatible AutoTuner to both Fused MoE and NVFP4 Linear operators (#3151)
    • Introduced a UserBuffers allocator for PyTorch flow (#3257)
    • Supported aborting disconnected requests (#3214)
    • Added support for FP8 MLA on NVIDIA Hopper and Blackwell GPUs (#3190)
    • Added an option to run disaggregated serving without context servers (#3243)
    • Enhanced RoPE support in AutoDeploy (#3115)
    • Fixed and improved allreduce and fusion kernels (#3064)
    • Added DeepSeek-V3 support in AutoDeploy (#3281)
    • Enhanced the integrated robustness of scaffolding via init.py (#3312)
  • API
    • Added numNodes to ParallelConfig (#3346)
    • Redesigned the multi‑stream API for DeepSeek (#3459)
  • Bug fixes
    • Fixed a wrong import of KvCacheConfig in examples/gpqa_llmapi.py (#3369)
    • Fixed the test name (#3534)
    • Fixed max_seq_len in executor_config (#3487)
    • Removed a duplicated line of code (#3523)
    • Disabled kv cache reuse for the prompt tuning test (#3474)
    • Fixed the issue of a first‑generation token being returned twice in streaming (#3427)
    • Added kv memory size per token calculation in the draft model (#3497)
    • Switched ZMQ from a file socket to a TCP socket in RemoteMpiCommSession (#3462)
    • Fixed PP for Llama (#3449)
    • Updated the default excluded_modules value for the fp8rowwise recipe (#3477)
    • Fixed disaggregation MTP with overlap (#3406)
    • Stopped memory estimation in start_attention (#3485)
    • Allowed the context_and_generation request type in disaggregated overlap (#3489)
    • Fixed the partial match issue (#3413)
    • Fixed Eagle decoding (#3456)
    • Fixed the py_decoding_iter update in the decoder (#3297)
    • Fixed the beam search diversity issue (#3375)
    • Updated ucxx to avoid occasional segfaults when profiling (#3420)
    • Fixed redrafter sampling (#3278)
    • Fixed mllama end‑to‑end PyTorch flow (#3397)
    • Reverted an extra CMake variable (#3351)
    • Fixed issues with the fused MoE path (#3435)
    • Fixed conflicting test names (#3316)
    • Fixed failing DeepSeek-V3 unit tests (#3385)
    • Fixed missing bias addition for FP4Linear (#3361)
    • Fixed the runtime error in test_deepseek_allreduce.py (#3226)
    • Fixed speculative decoding and multimodal input support (#3276)
    • Fixed PyTorch nvsmall via PyExecutor and improved TP support (#3238)
    • Fixed the p‑tuning test bug (#3326)
  • Performance
    • Cached sin and cos in the model instead of using a global LRU cache (#3378)
    • Deallocated tensors after use in MLA (#3286)
    • Enabled DeepGEMM by default (#3341)
    • Added a thread leak check and fixed thread/memory leak issues (#3270)
    • Used cudaMalloc to allocate kvCache (#3303)
    • Made ipc_periodically the default responses_handler (breaking change) (#3102)
    • Used NVRTC for DeepGEMM JIT compilation (#3239)
    • Optimized quantization kernels used in DeepSeek on Hopper (#3466)
  • Documentation
    • Added an example section for the multi‑node DeepSeek R1 benchmark on GB200 (#3519)
    • Documented disaggregation performance tuning (#3516)
    • Updated the perf‑benchmarking documentation for GPU configuration (#3458)
    • Updated the README and added a benchmarking blog for DeepSeek‑R1 (#3232)
    • Updated the documentation for using Draft‑Target‑Model (DTM) (#3366)
    • Updated the README for disaggregated serving (#3323)
    • Updated instructions to enable FP8 MLA for Deepseek. (#3488)

Full change log: 5aeef6d...258ae9c.

TensorRT-LLM Release 0.18.2

16 Apr 06:47
5aec7af
Compare
Choose a tag to compare

Key Features and Enhancements

TensorRT-LLM Release 0.18.1

09 Apr 01:11
62f3c95
Compare
Choose a tag to compare

Key Features and Enhancements

  • The 0.18.x series of releases builds upon the 0.17.0 release, focusing exclusively on dependency updates without incorporating features from the previous 0.18.0.dev pre-releases. These features will be included in future stable releases.

Infrastructure Changes

  • The dependent transformers package version is updated to 4.48.3.