[cherry-pick][0.9.1] rebase main #1250

momo609 · 2025-06-17T02:22:49Z

What this PR does / why we need it?

rebase main

Does this PR introduce any user-facing change?

How was this patch tested?

Yikun · 2025-06-17T02:55:15Z

CI failed due to missing #884

remove old quantization model, and new models will be added to testcase later. Signed-off-by: 22dimensions <waitingwind@foxmail.com> Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>

### What this PR does / why we need it? Update 0.9.0rc1 contributors info ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed Signed-off-by: Yikun Jiang <yikunkero@gmail.com> Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>

### What this PR does / why we need it? Make accuarcy CI and report work ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manaully review Signed-off-by: hfadzxy <starmoon_zhang@163.com> Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>

…t#1152) 1. Add `__init__.py` for vllm_ascend/compilation to make sure it's a python module 2. Fix model runner bug to keep the same with vllm 3. Add release note for 0.9.0rc2 --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>

Make sure the lint test passed before start the e2e test to save compute resource. Updated the patch doc to make sure the CI works as expect. Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>

### What this PR does / why we need it? Improve assertion on Graph mode with MLA. When running deepseek with graph mode, the fused MLA op only support `numHeads / numKvHeads ∈ {32, 64, 128}`, thus we improve the assertion info here to avoid users confused with this. ### Does this PR introduce _any_ user-facing change? Adjusting tp size is required when running deepseek-v3/r1 with graph mode. deepseek-v2-lite is not supported in graph mode. ### How was this patch tested? Test locally as the CI machine could not run V3 due to the HBM limits. --------- Signed-off-by: MengqingCao <cmq0113@163.com> Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>

1. rename vllm-ascend/Qwen2.5-0.5B-Instruct-W8A8-new to vllm-ascend/Qwen2.5-0.5B-Instruct-W8A8 Signed-off-by: 22dimensions <waitingwind@foxmail.com> Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>

…term CI pass (vllm-project#1163) [CI] Skip test_v1_spec_decode.py::test_ngram_correctness to make longterm CI pass Related: vllm-project#1162 Signed-off-by: MengqingCao <cmq0113@163.com> Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>

Contains on vllm-project#1111 for completeness.  ### What this PR does / why we need it? Implement multi-stream parallelism for MoE layers with shared experts, where computation of shared experts will be overlapped with expert token dispatch and combine. Also, when multi-stream is enabled, weights of shared experts will be force to replicate across all cards, regardless of any tensor parallelism configurations, to avoid AllReduce operations. With the expected overlaping being: ``` | shared gate_up | shared act | | shared down | | dispatch | routed gate_up, act, down | combine | ```  ### Does this PR introduce _any_ user-facing change? No.  ### How was this patch tested? Tested on 1x16 910 node, with tailored 2 layer DSKv2.  --------- Signed-off-by: sdmyzlp <lrwei2@petalmail.com> Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>

### What this PR does / why we need it? provide an e2e guide for execute duration profiling Signed-off-by: depeng1994 <depengzhang@foxmail.com> Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>

### What this PR does / why we need it? Single machine 16 cards deepseekr1 attention (tp8/dp2) / moe(etp) Best performance rely on: vllm-ascend commit id:da9acfca6053352730fce75fb772e214755d0341 vllm commit id:b124e1085b1bf977e3dac96d99ffd9d8ddfdb6cc + vllm-project#910 + [Reduce _npu_flash_attention mask to 128x128 for memory savings] vllm-project#1100 [Reduce memory usage by splitting tokens in fused_experts] --------- Signed-off-by: ttanzhiqiang <389825161@qq.com> Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>

…project#1159) Fix the doc typo in graph_mode.md Signed-off-by: yzim <43207690+yzim@users.noreply.github.com> Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>

@ttanzhiqiang

…-project#1098) What this PR does / why we need it? Enable kvcache_nz for the decode process in torchair graph mode, which reduces the time consumed by FA in long sequences. Does this PR introduce any user-facing change? If need to enable kvcache_nz, should set the additional_config.torchair_graph_config.enable_kv_nz=True How was this patch tested? 1. Tested in deepseek model: with batchsize 64 and seq_len 1k+3k, 61 layers FA total time improves 20.80ms -> 19.76ms 2. operator precision test: [aclnnFusedInferAttentionScoreV3_result.csv](https://github.com/user-attachments/files/20664138/aclnnFusedInferAttentionScoreV3_result.csv) 3. tpot test from @ttanzhiqiang, and curl one result is normal vllm-project#1098 (comment) vllm-project#1098 (comment) --------- Signed-off-by: chenwaner <861645847@qq.com> Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>

1. upgrade vllm to 0.9.1. 0.9.0 is not supported for main branch now. keep doc to 0.9.0 until we release the first 0.9.1 release. 2. disable V0 test for PR 3. move actionlint check to lint job Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>

…uler. (vllm-project#943) This PR adds support for speculative decoding in AsecendScheduler. Also inculde part of support for disaggregated prefill, full support will be merged in follow-up PR. --------- Signed-off-by: whx-sjtu <2952154980@qq.com> Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>

This PR add custom ascendc kernel vocabparallelembedding support in vllm-ascend, related CMakeLists and setuptools is also added in this PR. pytest -s benchmarks/ops/ben_vocabparallelembedding.py pytest -s tests/ops/test_vocabparallelembedding.py --------- Signed-off-by: ttanzhiqiang <389825161@qq.com> Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>

…oject#1099) ### What this PR does / why we need it? - Add qwen2.5-7b-instruct test - Add v1 test --------- Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>

### What this PR does / why we need it? - Add qwen2.5-7b performance benchmark, this is a sub pr of vllm-project#1099, for v1 test, need more verify - Fix get commit time after checkout --------- Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>

### What this PR does / why we need it? fix bug in 1p1d disaggregated_prefill example ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Tested with python find_device_ips.py and run disaggregated_prefill example  Signed-off-by: wangyanhui-cmss <wangyanhui_yewu@cmss.chinamobile.com> Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>

…#1192) ### What this PR does / why we need it? fix the CANN download url ### Does this PR introduce _any_ user-facing change? no, do not have any user-facing change ### How was this patch tested? run the **wget** command and cann package is rightly downloaded. --------- Signed-off-by: wan_danfeng <wonderful199082@126.com> Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>

### What this PR does / why we need it? Move all vector operations to a secondary stream, with the expected overlaping being: ``` | q_rmsnorm | | kv_norm_rope_cache | | q_rope | | matmul W_DQ | matmul W_DKV | index | index | matmul W_UQ | split | matmul W_KV_T | ``` Currently, the `IndexByTensor` operators introduced by computation of `cos` and `sin` can't be offloaded to the secondary stream due to a known bug of graph fusion optimization pass. So we instead keep it in the main stream, only requires it be computed before `matmul W_UQ` to avoid hindering later overlapping. The problem may be solved by later optimization (vllm-project#993), which hoists the computation of `cos` and `sin` up to the first layer. ### Does this PR introduce _any_ user-facing change? Controlled by `torchair_graph_config.enable_multistream_mla`, defaulted to False. ### How was this patch tested? Tested on 1x16 910 node, with tailored 2 layer DSKv2. Signed-off-by: sdmyzlp <lrwei2@petalmail.com> Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>

…1180) Last PR [vllm-project#943 ](vllm-project#943) wrongly open ut of AscendScheduler in V0 ci, this PR fixes this problem and only run ut of it in V1 ci. Signed-off-by: whx-sjtu <2952154980@qq.com> Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>

@jianzs

…1203) ### What this PR does / why we need it? Add @jianzs as vLLM Ascend maintainer @jianzs ---- I would like to nominate Shoujian Zheng (@jianzs <https://github.com/jianzs>) as a maintainer, starting with my +1. - He focuses on the code quality and good design with solid reviews in P/D disaggregation and DeepSeek improvement area about 30+ high quality review, such as #issuecomment-2811764833, #discussion_r2069927605 and #pullrequestreview-2820996674. This is the most important reason why I nominated him, because helping community developers complete PRs with high quality and continuously ensure the quality of codebase is one of the important responsibilities of a maintainer. We believe he is a great addition. - Shoujian's main expertise is distributed inference. He has a lot of experience in production about AI infra. He has very good habits and explains in great detail all changes #issue-3023082580 anqd share results open: #issuecomment-2853140443. And High quality PR: vllm-project#706, vllm-project#774, vllm-project#852. - Community Involvement: Active involved in community discussion, he is collaborative and helps the users solve problems, involved in 30+ PR and issue, such as #issuecomment-2911934292 and #issuecomment-2833523571. Reference: [1] https://vllm-ascend.readthedocs.io/en/latest/community/contributors.html [2] https://vllm-ascend.readthedocs.io/en/latest/community/governance.html Signed-off-by: Yikun Jiang <yikunkero@gmail.com> Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>

### What this PR does / why we need it? Add ut for torchair graph mode on DeepSeekV3 ### How was this patch tested? CI passed with new added test. --------- Signed-off-by: MengqingCao <cmq0113@163.com> Signed-off-by: Mengqing Cao <cmq0113@163.com> Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>

os.environ["VLLM_USE_V1"] must be assigned with str, not other type. ![image](https://github.com/user-attachments/assets/9d337ae5-00e5-4179-832e-c6c917dd5798) Signed-off-by: 22dimensions <waitingwind@foxmail.com> Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>

…ect#884) 1. Remove RAY_EXPERIMENTAL_NOSET_ASCEND_RT_VISIBLE_DEVICES 2. Add lazy init for vllm_ascend_C Signed-off-by: zhuo97 <1103045176@qq.com> Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>

Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>

…project#1203)" This reverts commit 70864b6. Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>

github-actions bot added documentation Improvements or additions to documentation ci/build module:tests module:ops module:core module:quantization labels Jun 17, 2025

momo609 force-pushed the 091dev branch from 045c526 to 837e37a Compare June 17, 2025 06:56

momo609 changed the title ~~091dev~~ [cherry-pick][0.9.1] rebase main Jun 17, 2025

22dimensions and others added 21 commits June 17, 2025 17:27

[CI] remove old quantization model (vllm-project#1003)

07411f9

remove old quantization model, and new models will be added to testcase later. Signed-off-by: 22dimensions <waitingwind@foxmail.com> Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>

[CI] rename Qwen2.5-0.5B-Instruct-W8A8 model (vllm-project#1145)

cf419aa

1. rename vllm-ascend/Qwen2.5-0.5B-Instruct-W8A8-new to vllm-ascend/Qwen2.5-0.5B-Instruct-W8A8 Signed-off-by: 22dimensions <waitingwind@foxmail.com> Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>

provide an e2e guide for execute duration profiling (vllm-project#1113)

5f89652

### What this PR does / why we need it? provide an e2e guide for execute duration profiling Signed-off-by: depeng1994 <depengzhang@foxmail.com> Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>

[Doc] Fix the config parameter name "enable" in graph_mode.md. (vllm-…

933e261

…project#1159) Fix the doc typo in graph_mode.md Signed-off-by: yzim <43207690+yzim@users.noreply.github.com> Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>

whx-sjtu and others added 7 commits June 17, 2025 17:27

Fix the device error when using ray as vllm-acend backend (vllm-proj…

45e802c

…ect#884) 1. Remove RAY_EXPERIMENTAL_NOSET_ASCEND_RT_VISIBLE_DEVICES 2. Add lazy init for vllm_ascend_C Signed-off-by: zhuo97 <1103045176@qq.com> Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>

remove main vll verison.

94d0f07

Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>

Revert "Add ShouJian Zheng (@jianzs) as vLLM Ascend maintainer (vllm-…

f9491b8

…project#1203)" This reverts commit 70864b6. Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>

momo609 force-pushed the 091dev branch from 96749bc to f9491b8 Compare June 17, 2025 09:27

ganyi1996ppo approved these changes Jun 17, 2025

View reviewed changes

ganyi1996ppo merged commit 030fe89 into vllm-project:v0.9.1-dev Jun 17, 2025
17 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[cherry-pick][0.9.1] rebase main #1250

[cherry-pick][0.9.1] rebase main #1250

Uh oh!

momo609 commented Jun 17, 2025 •

edited

Loading

Uh oh!

Yikun commented Jun 17, 2025

Uh oh!

Uh oh!

Uh oh!

[cherry-pick][0.9.1] rebase main #1250

[cherry-pick][0.9.1] rebase main #1250

Uh oh!

Conversation

momo609 commented Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Yikun commented Jun 17, 2025

Uh oh!

Uh oh!

Uh oh!

momo609 commented Jun 17, 2025 •

edited

Loading