Description
This issue track the workflow for the next release 0.9.0rc1
PR merged
Need merge first:
#1016
- [CI] Add accuracy test for Qwen2.5-VL-3B-Instruct #766
- [Bugfix][Worker] Clear NPU memory between test profiling #989
- [CI] Re-enable sleep mode test and skip failure breaking CI #990
- [CI] remove old quantization model #1003
- [Patch] Remove
spec_decode.metrics
patch #1016 - [Misc] Refactor additional_config #1029
- [ModelRunner] Support embedding inputs #916
Need review and merge:
- dense and multimodal
- Add qwen2.5 vl multimodal feature for vllm-ascend v1 #736
- [V1][Structured Output] Enable Speculative Decoding with Structured Outputs #751
- DeepSeek:
- optimize the funtion of computing topk and topp in sampler. #970
- MLA layer eliminates redundant index operators #993
- feat: support data parallel for deepseek #1012
- [MTP][V1] Adapt mtp with graph mode in v1. #1023
- [Bugfix] Fix deepseek percision issue and add acc ci for it #905
- [Performance] Add EPLB expert map import capabilities #919
- [perf]: support dual-batch overlap(dbo) for deepseek #941
- RL:
- [Kernel] Remove cumsum in groupedmatmul #987
- Support multistream of shared experts in FusedMoE #997
- [CI]Moe alltoall communication optimization #1067
- [ModelRunner]Add profile execute duration observation #1013
Pending (will NOT be included in 0.9.0rc1):
-
[perf] Improve Prefill Performance by Optimizing Alltoall Communication #978
-
[ModelRunner][MultiModal] Automatically cast multi-modal input dtype #1002
-
Revert the modifications of cache engine for npu graph mode #875
-
[Feature] Synchronize vLLM mrope modifications, support Qwen2.5-OMni-7B Thinker. #973 need V1 support
-
[Platform] Add support for Ascend 310P #914 need torch npu public new version
requirement
- PTA + CANN upgrade
Functional Test (V1)
-
Qwen3/Qwen2.5: aclgraph + Qwen2.5/Qwen3 @MengqingCao
-
Qwen2.5 VL: eager mode @shen-shanshan @ChenTaoyu-SJTU
-
DeepSeek: torchair + deepseek @zzzzwwjj
-
Quantization (w8a8) + Qwen2.5/Qwen3/DeepSeek @22dimensions
- Modelscope download
- w8a8 e2e test refresh to new test
- doc: User Guide - quantization
-
spec decode + mtp @mengwei805
-
performance @Potabk
- V0 Qwen3
- V0 Qwen2.5
- V0 Qwen2.5 VL
- V1 Qwen3 need bugfix
- V1 Qwen2.5 need bugfix
- V1 Qwen2.5 VL
-
accuracy @zhangxinyuehfad
-
- dp+enger
- dp+torchair
-
EP @wangxiyuan
- EPLB
-
PD @wangxiyuan
- 1P1D E2E - simple connector
- 1P1D E2E - pyhccl
-
RL related @leo-pony
- sleep mode
- VL pad @shen-shanshan
-
(V0) Input embeding @Potabk
-
(V1) Structed ouput @shen-shanshan
-
(V1) AscendScheduler @MengqingCao
-
(V1) Scheduler CP/APC @wangxiyuan
Documentation
- addintion config
- environment
- release note
- graph mode
Release
- tag、binary、image、pypi