Skip to content

v0.21.0rc1

Pre-release
Pre-release
Compare
Choose a tag to compare
@nv-guomingz nv-guomingz released this 11 Jun 05:27
· 106 commits to main since this release
9c012d5

Highlights

  • Model Support
    • Add HyperCLOVAX-SEED-Vision support for PyTorch flow (#4799)
  • Features
    • Support generation logits in TRTLLM Sampler (#4819
    • Support for large scale-EP(#4818)
    • Support XQA-based MLA on SM120 (#4858)
    • Add PositionEmbeddingType=0 to xqa support (#4934)
    • Add cache reuse support (selective cache transfer) in mla cache formatter (#4749)
    • Update DeepSeek FP8 TRT-LLM Gen cubins (#4643)
    • Add heuristics for checkpoint files prefetching (#4765)
    • Enable NVFP4 output for TRTLLM attention kernels (#4737)
    • Refactor Fused MoE (#4790)
    • Add integration of etcd (#3738)
    • Memoize weight shuffle index to speed up weight preproc in moe_backend=TRTLLM (#4826)
    • Enable Disaggregated serving for QWen-3 (#4929)
  • API
    • Set _AutoDeployLlmArgs as primary config object (#4891)
  • Bug Fixes
    • Fix warmup phase batch size out of range (#4986)
    • Fix buffer count (#5007)
    • Fix nvbug 5324252 test_resource_manager.py broken (#4925)
    • Fix nvbug 5280806 2 model spec decode flow (#4807)
    • Fix nvbug 5324248 test_pytorch_model_engine.py broken (#4973)
    • Fix cuda graph padding for spec decoding (#4853)
    • Correct the order of llm request state (#4781)
    • Handle OOMs during KV cache estimation (#4690)
    • Only pass fast_build=true to non-pytorch backend (#4920)
    • Fix the no fusion all reduce hanging (#4594)
    • Deprecate AutoDeploy CI post-merge tests and keep them for local testing (#4892)
    • Fix nvbug 5302895 test_trtllm_bench_llmapi_launch fail(#4835)
    • Fix llama 4 long context issue (#4809)
    • Fix nvbug 5300080 the bug of setting attention_chunk_size and enable
    • chunked-attention in the generation-phase by default (#4693)
    • Fix nvbug 5294316 queued request stats (#4714)
    • Fix max_num_sequences calculation with overlap scheduling (#4532)
    • Fix trtllm-bench hang issue due to LLM API IPC (#4798)
    • Fix a pd+mtp accuracy issue (#4536)
  • Benchmark
    • Add beam width to low latency. (#4812)
    • Fix trtllm-bench iter_stats and cuda_graph_batch_sizes errors. (#4827)
  • Performance
  • Infrastructure
    • TRT-LLM team formally releases docker image on NGC.
    • Update jnlp version in container image (#4944)
    • Upgraded ModelOpt to 0.31.0 (#5003)
    • Upgrade Flash-infer to 0.2.5 (#5004)
  • Documentation
    • doc: Document the docker release image on NGC #4705
    • Fix readme for disaggregated serving (#4846)
    • Fix draft target README and set exclude_input_in_output to False (#4882)
    • blog: Scaling Expert Parallelism in TensorRT-LLM (Part 1: Design and Implementation of Large-scale EP) #4958
  • Known Issues

What's Changed

New Contributors

Full Changelog: v0.21.0rc0...v0.21.0rc1