Skip to content

v0.21.0rc0

Pre-release
Pre-release
Compare
Choose a tag to compare
@Shixiaowei02 Shixiaowei02 released this 04 Jun 02:48
· 200 commits to main since this release
9ae2ce6

Highlights

  • Model Support
  • Features
    • Support for large-scale EP (#4384, #4495 , #4615)
    • Added chunked attention kernels (#4291, #4394)
    • ScaffoldingLLM now supports MCP (#4410)
    • Integrated NIXL into the communication layer of the disaggregated service (#3934, #4125)
    • Integrated Hopper chunked attention kernels (#4330)
    • Enabled TRT backend for Python runtime in disaggregated service (#4243)
    • Added FP8 block-scale GEMM support on SM89 (#4481)
    • Qwen3 FP4 MoE TRTLLM backend for low-latency (#4530)
    • Introduced sliding-window attention kernels for the generation phase on Blackwell (#4564)
    • Vanilla MOE added (#4682)
    • Fused QKNorm + RoPE integration (#4611)
    • Fabric Memory support for KV Cache Transfer (#4717)
  • API
  • Bug Fixes
    • Resolved Torch compile issue for DeepSeek V3 (#3952)
    • Fixed trtllm-llmapi-launch for single-node, single-GPU setups (#4428)
    • Removed duplicate tokenization in generation server (#4492)
    • Fixed cancel request handling for attentionDP (#4648)
    • Fixed disaggregated service hang when MNNVL two-shot AllReduce is enabled (#4678)
    • Fixed queued request statistics (#4806)
    • Fixed EP load balancer with MTP layer and route offset by EP rank (#4767)
    • Resolved accuracy and illegal memory access issues with MTP + attention DP (#4379)
  • Benchmark
    • Added all_reduce.py benchmark script for testing (#4537)
  • Performance
  • Infrastructure
    • Integrated NGC image into Makefile automation and documentation (#4400)
    • Built Triton for ARM architecture (#4456)
    • Added triton release container (#4455)
    • Refactored Docker build image (Groovy) and added NGC image support (#4294)
    • Upgraded Cutlass to version 4.0 (#4794)
  • Documentation
    • Updated descriptions for NGC Docker images (#4702, #4705)
  • Known Issues
    • Two important fixes are NOT included in this release, but they are already in the main branch
      • Fix the bug of setting attention_chunk_size and enable chunked-attention in the generation-phase by default (#4693)
      • Fixed the failure of the LLMAPI benchmark caused by a serialization issue (#4835)

What's Changed

New Contributors

Full Changelog: v0.20.0rc3...v0.21.0rc0