-
Notifications
You must be signed in to change notification settings - Fork 234
[cherry-pick][0.9.1] rebase main #1250
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
CI failed due to missing #884 |
remove old quantization model, and new models will be added to testcase later. Signed-off-by: 22dimensions <waitingwind@foxmail.com> Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>
### What this PR does / why we need it? Update 0.9.0rc1 contributors info ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed Signed-off-by: Yikun Jiang <yikunkero@gmail.com> Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>
### What this PR does / why we need it? Make accuarcy CI and report work ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manaully review Signed-off-by: hfadzxy <starmoon_zhang@163.com> Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>
…t#1152) 1. Add `__init__.py` for vllm_ascend/compilation to make sure it's a python module 2. Fix model runner bug to keep the same with vllm 3. Add release note for 0.9.0rc2 --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>
Make sure the lint test passed before start the e2e test to save compute resource. Updated the patch doc to make sure the CI works as expect. Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>
### What this PR does / why we need it? Improve assertion on Graph mode with MLA. When running deepseek with graph mode, the fused MLA op only support `numHeads / numKvHeads ∈ {32, 64, 128}`, thus we improve the assertion info here to avoid users confused with this. ### Does this PR introduce _any_ user-facing change? Adjusting tp size is required when running deepseek-v3/r1 with graph mode. deepseek-v2-lite is not supported in graph mode. ### How was this patch tested? Test locally as the CI machine could not run V3 due to the HBM limits. --------- Signed-off-by: MengqingCao <cmq0113@163.com> Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>
1. rename vllm-ascend/Qwen2.5-0.5B-Instruct-W8A8-new to vllm-ascend/Qwen2.5-0.5B-Instruct-W8A8 Signed-off-by: 22dimensions <waitingwind@foxmail.com> Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>
…term CI pass (vllm-project#1163) [CI] Skip test_v1_spec_decode.py::test_ngram_correctness to make longterm CI pass Related: vllm-project#1162 Signed-off-by: MengqingCao <cmq0113@163.com> Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>
Contains on vllm-project#1111 for completeness. <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? Implement multi-stream parallelism for MoE layers with shared experts, where computation of shared experts will be overlapped with expert token dispatch and combine. Also, when multi-stream is enabled, weights of shared experts will be force to replicate across all cards, regardless of any tensor parallelism configurations, to avoid AllReduce operations. With the expected overlaping being: ``` | shared gate_up | shared act | | shared down | | dispatch | routed gate_up, act, down | combine | ``` <!-- - Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. - Please clarify why the changes are needed. For instance, the use case and bug description. - Fixes # --> ### Does this PR introduce _any_ user-facing change? No. <!-- Note that it means *any* user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> ### How was this patch tested? Tested on 1x16 910 node, with tailored 2 layer DSKv2. <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> --------- Signed-off-by: sdmyzlp <lrwei2@petalmail.com> Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>
### What this PR does / why we need it? provide an e2e guide for execute duration profiling Signed-off-by: depeng1994 <depengzhang@foxmail.com> Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>
### What this PR does / why we need it? Single machine 16 cards deepseekr1 attention (tp8/dp2) / moe(etp) Best performance rely on: vllm-ascend commit id:da9acfca6053352730fce75fb772e214755d0341 vllm commit id:b124e1085b1bf977e3dac96d99ffd9d8ddfdb6cc + vllm-project#910 + [Reduce _npu_flash_attention mask to 128x128 for memory savings] vllm-project#1100 [Reduce memory usage by splitting tokens in fused_experts] --------- Signed-off-by: ttanzhiqiang <389825161@qq.com> Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>
…project#1159) Fix the doc typo in graph_mode.md Signed-off-by: yzim <43207690+yzim@users.noreply.github.com> Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>
…-project#1098) What this PR does / why we need it? Enable kvcache_nz for the decode process in torchair graph mode, which reduces the time consumed by FA in long sequences. Does this PR introduce any user-facing change? If need to enable kvcache_nz, should set the additional_config.torchair_graph_config.enable_kv_nz=True How was this patch tested? 1. Tested in deepseek model: with batchsize 64 and seq_len 1k+3k, 61 layers FA total time improves 20.80ms -> 19.76ms 2. operator precision test: [aclnnFusedInferAttentionScoreV3_result.csv](https://github.com/user-attachments/files/20664138/aclnnFusedInferAttentionScoreV3_result.csv) 3. tpot test from @ttanzhiqiang, and curl one result is normal vllm-project#1098 (comment) vllm-project#1098 (comment) --------- Signed-off-by: chenwaner <861645847@qq.com> Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>
1. upgrade vllm to 0.9.1. 0.9.0 is not supported for main branch now. keep doc to 0.9.0 until we release the first 0.9.1 release. 2. disable V0 test for PR 3. move actionlint check to lint job Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>
…uler. (vllm-project#943) This PR adds support for speculative decoding in AsecendScheduler. Also inculde part of support for disaggregated prefill, full support will be merged in follow-up PR. --------- Signed-off-by: whx-sjtu <2952154980@qq.com> Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>
This PR add custom ascendc kernel vocabparallelembedding support in vllm-ascend, related CMakeLists and setuptools is also added in this PR. pytest -s benchmarks/ops/ben_vocabparallelembedding.py pytest -s tests/ops/test_vocabparallelembedding.py --------- Signed-off-by: ttanzhiqiang <389825161@qq.com> Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>
…oject#1099) ### What this PR does / why we need it? - Add qwen2.5-7b-instruct test - Add v1 test --------- Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>
### What this PR does / why we need it? - Add qwen2.5-7b performance benchmark, this is a sub pr of vllm-project#1099, for v1 test, need more verify - Fix get commit time after checkout --------- Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>
### What this PR does / why we need it? fix bug in 1p1d disaggregated_prefill example ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Tested with python find_device_ips.py and run disaggregated_prefill example <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> Signed-off-by: wangyanhui-cmss <wangyanhui_yewu@cmss.chinamobile.com> Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>
…#1192) ### What this PR does / why we need it? fix the CANN download url ### Does this PR introduce _any_ user-facing change? no, do not have any user-facing change ### How was this patch tested? run the **wget** command and cann package is rightly downloaded. --------- Signed-off-by: wan_danfeng <wonderful199082@126.com> Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>
### What this PR does / why we need it? Move all vector operations to a secondary stream, with the expected overlaping being: ``` | q_rmsnorm | | kv_norm_rope_cache | | q_rope | | matmul W_DQ | matmul W_DKV | index | index | matmul W_UQ | split | matmul W_KV_T | ``` Currently, the `IndexByTensor` operators introduced by computation of `cos` and `sin` can't be offloaded to the secondary stream due to a known bug of graph fusion optimization pass. So we instead keep it in the main stream, only requires it be computed before `matmul W_UQ` to avoid hindering later overlapping. The problem may be solved by later optimization (vllm-project#993), which hoists the computation of `cos` and `sin` up to the first layer. ### Does this PR introduce _any_ user-facing change? Controlled by `torchair_graph_config.enable_multistream_mla`, defaulted to False. ### How was this patch tested? Tested on 1x16 910 node, with tailored 2 layer DSKv2. Signed-off-by: sdmyzlp <lrwei2@petalmail.com> Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>
…1180) Last PR [vllm-project#943 ](vllm-project#943) wrongly open ut of AscendScheduler in V0 ci, this PR fixes this problem and only run ut of it in V1 ci. Signed-off-by: whx-sjtu <2952154980@qq.com> Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>
…1203) ### What this PR does / why we need it? Add @jianzs as vLLM Ascend maintainer @jianzs ---- I would like to nominate Shoujian Zheng (@jianzs <https://github.com/jianzs>) as a maintainer, starting with my +1. - He focuses on the code quality and good design with solid reviews in P/D disaggregation and DeepSeek improvement area about 30+ high quality review, such as #issuecomment-2811764833, #discussion_r2069927605 and #pullrequestreview-2820996674. This is the most important reason why I nominated him, because helping community developers complete PRs with high quality and continuously ensure the quality of codebase is one of the important responsibilities of a maintainer. We believe he is a great addition. - Shoujian's main expertise is distributed inference. He has a lot of experience in production about AI infra. He has very good habits and explains in great detail all changes #issue-3023082580 anqd share results open: #issuecomment-2853140443. And High quality PR: vllm-project#706, vllm-project#774, vllm-project#852. - Community Involvement: Active involved in community discussion, he is collaborative and helps the users solve problems, involved in 30+ PR and issue, such as #issuecomment-2911934292 and #issuecomment-2833523571. Reference: [1] https://vllm-ascend.readthedocs.io/en/latest/community/contributors.html [2] https://vllm-ascend.readthedocs.io/en/latest/community/governance.html Signed-off-by: Yikun Jiang <yikunkero@gmail.com> Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>
### What this PR does / why we need it? Add ut for torchair graph mode on DeepSeekV3 ### How was this patch tested? CI passed with new added test. --------- Signed-off-by: MengqingCao <cmq0113@163.com> Signed-off-by: Mengqing Cao <cmq0113@163.com> Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>
os.environ["VLLM_USE_V1"] must be assigned with str, not other type.  Signed-off-by: 22dimensions <waitingwind@foxmail.com> Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>
…ect#884) 1. Remove RAY_EXPERIMENTAL_NOSET_ASCEND_RT_VISIBLE_DEVICES 2. Add lazy init for vllm_ascend_C Signed-off-by: zhuo97 <1103045176@qq.com> Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>
Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>
…project#1203)" This reverts commit 70864b6. Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>
ganyi1996ppo
approved these changes
Jun 17, 2025
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
ci/build
documentation
Improvements or additions to documentation
module:core
module:ops
module:quantization
module:tests
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What this PR does / why we need it?
rebase main
Does this PR introduce any user-facing change?
How was this patch tested?