Skip to content

[cherry-pick][0.9.1] rebase main #1250

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 28 commits into from
Jun 17, 2025
Merged

Conversation

momo609
Copy link
Contributor

@momo609 momo609 commented Jun 17, 2025

What this PR does / why we need it?

rebase main

Does this PR introduce any user-facing change?

How was this patch tested?

@Yikun
Copy link
Collaborator

Yikun commented Jun 17, 2025

CI failed due to missing #884

@momo609 momo609 changed the title 091dev [cherry-pick][0.9.1] rebase main Jun 17, 2025
22dimensions and others added 21 commits June 17, 2025 17:27
remove old quantization model, and new models will be added to testcase
later.

Signed-off-by: 22dimensions <waitingwind@foxmail.com>
Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>
### What this PR does / why we need it?
Update 0.9.0rc1 contributors info

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>
### What this PR does / why we need it?
Make accuarcy CI and report work

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Manaully review

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>
…t#1152)

1. Add `__init__.py` for vllm_ascend/compilation to make sure it's a
python module
2. Fix model runner bug to keep the same with vllm
3. Add release note for 0.9.0rc2

---------

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>
Make sure the lint test passed before start the e2e test to save compute
resource.

Updated the patch doc to make sure the CI works as expect.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>
### What this PR does / why we need it?
Improve assertion on Graph mode with MLA.

When running deepseek with graph mode, the fused MLA op only support
`numHeads / numKvHeads ∈ {32, 64, 128}`, thus we improve the assertion
info here to avoid users confused with this.

### Does this PR introduce _any_ user-facing change?
Adjusting tp size is required when running deepseek-v3/r1 with graph
mode. deepseek-v2-lite is not supported in graph mode.

### How was this patch tested?
Test locally as the CI machine could not run V3 due to the HBM limits.

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>
1. rename vllm-ascend/Qwen2.5-0.5B-Instruct-W8A8-new to
vllm-ascend/Qwen2.5-0.5B-Instruct-W8A8

Signed-off-by: 22dimensions <waitingwind@foxmail.com>
Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>
…term CI pass (vllm-project#1163)

[CI] Skip test_v1_spec_decode.py::test_ngram_correctness to make
longterm CI pass

Related: vllm-project#1162

Signed-off-by: MengqingCao <cmq0113@163.com>
Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>
Contains on vllm-project#1111 for completeness.

<!--  Thanks for sending a pull request!

BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html

-->
### What this PR does / why we need it?
Implement multi-stream parallelism for MoE layers with shared experts,
where computation of shared experts will be overlapped with expert token
dispatch and combine. Also, when multi-stream is enabled, weights of
shared experts will be force to replicate across all cards, regardless
of any tensor parallelism configurations, to avoid AllReduce operations.

With the expected overlaping being:
```
| shared gate_up | shared act |              | shared down |
|    dispatch    | routed gate_up, act, down |   combine   |
```

<!--
- Please clarify what changes you are proposing. The purpose of this
section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster
reviews in your PR.

- Please clarify why the changes are needed. For instance, the use case
and bug description.

- Fixes #
-->

### Does this PR introduce _any_ user-facing change?
No.

<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->

### How was this patch tested?
Tested on 1x16 910 node, with tailored 2 layer DSKv2.
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->

---------

Signed-off-by: sdmyzlp <lrwei2@petalmail.com>
Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>
### What this PR does / why we need it?
provide an e2e guide for execute duration profiling

Signed-off-by: depeng1994 <depengzhang@foxmail.com>
Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>
### What this PR does / why we need it?
Single machine 16 cards deepseekr1 attention (tp8/dp2) / moe(etp) Best
performance

rely on:
vllm-ascend commit id:da9acfca6053352730fce75fb772e214755d0341
vllm commit id:b124e1085b1bf977e3dac96d99ffd9d8ddfdb6cc
+ vllm-project#910 + [Reduce
_npu_flash_attention mask to 128x128 for memory savings]
vllm-project#1100 [Reduce memory
usage by splitting tokens in fused_experts]

---------

Signed-off-by: ttanzhiqiang <389825161@qq.com>
Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>
…project#1159)

Fix the doc typo in graph_mode.md

Signed-off-by: yzim <43207690+yzim@users.noreply.github.com>
Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>
…-project#1098)

What this PR does / why we need it?
Enable kvcache_nz for the decode process in torchair graph mode, which
reduces the time consumed by FA in long sequences.

Does this PR introduce any user-facing change?
If need to enable kvcache_nz, should set the
additional_config.torchair_graph_config.enable_kv_nz=True

How was this patch tested?
1. Tested in deepseek model:
with batchsize 64 and seq_len 1k+3k, 61 layers FA total time improves
20.80ms -> 19.76ms
2. operator precision test:

[aclnnFusedInferAttentionScoreV3_result.csv](https://github.com/user-attachments/files/20664138/aclnnFusedInferAttentionScoreV3_result.csv)
3. tpot test from @ttanzhiqiang, and curl one result is normal

vllm-project#1098 (comment)

vllm-project#1098 (comment)

---------

Signed-off-by: chenwaner <861645847@qq.com>
Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>
1. upgrade vllm to 0.9.1. 0.9.0 is not supported for main branch now.
keep doc to 0.9.0 until we release the first 0.9.1 release.
2. disable V0 test for PR
3. move actionlint check to lint job

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>
…uler. (vllm-project#943)

This PR adds support for speculative decoding in AsecendScheduler.
Also inculde part of support for disaggregated prefill, full support
will be merged in follow-up PR.

---------

Signed-off-by: whx-sjtu <2952154980@qq.com>
Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>
This PR add custom ascendc kernel vocabparallelembedding support in
vllm-ascend, related CMakeLists and setuptools is also added in this PR.

pytest -s benchmarks/ops/ben_vocabparallelembedding.py
pytest -s tests/ops/test_vocabparallelembedding.py

---------

Signed-off-by: ttanzhiqiang <389825161@qq.com>
Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>
…oject#1099)

### What this PR does / why we need it?
- Add qwen2.5-7b-instruct test
- Add v1 test
---------

Signed-off-by: wangli <wangli858794774@gmail.com>
Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>
### What this PR does / why we need it?
- Add qwen2.5-7b performance benchmark, this is a sub pr of vllm-project#1099, for
v1 test, need more verify
- Fix get commit time after checkout

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>
### What this PR does / why we need it?
fix  bug in 1p1d  disaggregated_prefill  example

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Tested with python find_device_ips.py and run disaggregated_prefill
example

<!--  Thanks for sending a pull request!

BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html

-->

Signed-off-by: wangyanhui-cmss <wangyanhui_yewu@cmss.chinamobile.com>
Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>
…#1192)

### What this PR does / why we need it?
fix the CANN download url

### Does this PR introduce _any_ user-facing change?
no, do not have any user-facing change

### How was this patch tested?
run the **wget** command and cann package is rightly downloaded.

---------

Signed-off-by: wan_danfeng <wonderful199082@126.com>
Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>
### What this PR does / why we need it?
Move all vector operations to a secondary stream, with the expected
overlaping being:
```
              | q_rmsnorm |                  | kv_norm_rope_cache |       | q_rope |
| matmul W_DQ | matmul W_DKV | index | index |    matmul W_UQ     | split | matmul W_KV_T |
```

Currently, the `IndexByTensor` operators introduced by computation of
`cos` and `sin` can't be offloaded to the secondary stream due to a
known bug of graph fusion optimization pass. So we instead keep it in
the main stream, only requires it be computed before `matmul W_UQ` to
avoid hindering later overlapping. The problem may be solved by later
optimization (vllm-project#993), which hoists the computation of `cos` and `sin` up
to the first layer.

### Does this PR introduce _any_ user-facing change?
Controlled by `torchair_graph_config.enable_multistream_mla`, defaulted
to False.

### How was this patch tested?
Tested on 1x16 910 node, with tailored 2 layer DSKv2.

Signed-off-by: sdmyzlp <lrwei2@petalmail.com>
Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>
whx-sjtu and others added 7 commits June 17, 2025 17:27
…1180)

Last PR [vllm-project#943 ](vllm-project#943)
wrongly open ut of AscendScheduler in V0 ci, this PR fixes this problem
and only run ut of it in V1 ci.

Signed-off-by: whx-sjtu <2952154980@qq.com>
Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>
…1203)

### What this PR does / why we need it?

Add @jianzs as vLLM Ascend maintainer

@jianzs
----
I would like to nominate Shoujian Zheng (@jianzs
<https://github.com/jianzs>) as a maintainer, starting with my +1.

- He focuses on the code quality and good design with solid reviews in P/D
disaggregation and DeepSeek improvement area about 30+ high quality review, such
as #issuecomment-2811764833, #discussion_r2069927605 and
#pullrequestreview-2820996674. This is the most important reason why I nominated
him, because helping community developers complete PRs with high quality and
continuously ensure the quality of codebase is one of the important
responsibilities of a maintainer. We believe he is a great addition.
- Shoujian's main expertise is distributed inference. He has a lot of experience
in production about AI infra. He has very good habits and explains in great
detail all changes #issue-3023082580 anqd share results open:
#issuecomment-2853140443. And High quality PR: vllm-project#706, vllm-project#774, vllm-project#852.
- Community Involvement: Active involved in community discussion, he is
collaborative and helps the users solve problems, involved in 30+ PR and issue,
such as #issuecomment-2911934292 and #issuecomment-2833523571.

Reference:
[1] https://vllm-ascend.readthedocs.io/en/latest/community/contributors.html
[2] https://vllm-ascend.readthedocs.io/en/latest/community/governance.html

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>
### What this PR does / why we need it?
Add ut for torchair graph mode on DeepSeekV3

### How was this patch tested?
CI passed with new added test.

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
Signed-off-by: Mengqing Cao <cmq0113@163.com>
Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>
os.environ["VLLM_USE_V1"] must be assigned with str, not other type.

![image](https://github.com/user-attachments/assets/9d337ae5-00e5-4179-832e-c6c917dd5798)

Signed-off-by: 22dimensions <waitingwind@foxmail.com>
Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>
…ect#884)

1. Remove RAY_EXPERIMENTAL_NOSET_ASCEND_RT_VISIBLE_DEVICES
2. Add lazy init for vllm_ascend_C

Signed-off-by: zhuo97 <1103045176@qq.com>
Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>
Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>
…project#1203)"

This reverts commit 70864b6.

Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>
@ganyi1996ppo ganyi1996ppo merged commit 030fe89 into vllm-project:v0.9.1-dev Jun 17, 2025
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.