New 091 #1658

shiyuan680 · 2025-07-07T12:48:11Z

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

remove old quantization model, and new models will be added to testcase later. Signed-off-by: 22dimensions <waitingwind@foxmail.com>

### What this PR does / why we need it? Update 0.9.0rc1 contributors info ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed Signed-off-by: Yikun Jiang <yikunkero@gmail.com>

### What this PR does / why we need it? Make accuarcy CI and report work ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manaully review Signed-off-by: hfadzxy <starmoon_zhang@163.com>

…t#1152) 1. Add `__init__.py` for vllm_ascend/compilation to make sure it's a python module 2. Fix model runner bug to keep the same with vllm 3. Add release note for 0.9.0rc2 --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>

Make sure the lint test passed before start the e2e test to save compute resource. Updated the patch doc to make sure the CI works as expect. Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>

add eplb policy and updator

implementation of VllmEplbAdaptor and D2DExpertWeightLoader

…nfiguration

determine num_dense_layers and num_moe_layers by refering to model co…

EPLB add eplb_worker

Dev mereg wjh

…lm-project#1160) ### What this PR does / why we need it? The former PR vllm-project#736 select the valid token inside the `input_ids` and `position_ids` breaks the necessary padding required by torchair. In this PR, we pending the pad logic after the multimodal part. Signed-off-by: ganyi <pleaplusone.gy@gmail.com>

### What this PR does / why we need it? Improve assertion on Graph mode with MLA. When running deepseek with graph mode, the fused MLA op only support `numHeads / numKvHeads ∈ {32, 64, 128}`, thus we improve the assertion info here to avoid users confused with this. ### Does this PR introduce _any_ user-facing change? Adjusting tp size is required when running deepseek-v3/r1 with graph mode. deepseek-v2-lite is not supported in graph mode. ### How was this patch tested? Test locally as the CI machine could not run V3 due to the HBM limits. --------- Signed-off-by: MengqingCao <cmq0113@163.com>

1. rename vllm-ascend/Qwen2.5-0.5B-Instruct-W8A8-new to vllm-ascend/Qwen2.5-0.5B-Instruct-W8A8 Signed-off-by: 22dimensions <waitingwind@foxmail.com>

…term CI pass (vllm-project#1163) [CI] Skip test_v1_spec_decode.py::test_ngram_correctness to make longterm CI pass Related: vllm-project#1162 Signed-off-by: MengqingCao <cmq0113@163.com>

Contains on vllm-project#1111 for completeness.  ### What this PR does / why we need it? Implement multi-stream parallelism for MoE layers with shared experts, where computation of shared experts will be overlapped with expert token dispatch and combine. Also, when multi-stream is enabled, weights of shared experts will be force to replicate across all cards, regardless of any tensor parallelism configurations, to avoid AllReduce operations. With the expected overlaping being: ``` | shared gate_up | shared act | | shared down | | dispatch | routed gate_up, act, down | combine | ```  ### Does this PR introduce _any_ user-facing change? No.  ### How was this patch tested? Tested on 1x16 910 node, with tailored 2 layer DSKv2.  --------- Signed-off-by: sdmyzlp <lrwei2@petalmail.com>

### What this PR does / why we need it? provide an e2e guide for execute duration profiling Signed-off-by: depeng1994 <depengzhang@foxmail.com>

### What this PR does / why we need it? Single machine 16 cards deepseekr1 attention (tp8/dp2) / moe(etp) Best performance rely on: vllm-ascend commit id:da9acfca6053352730fce75fb772e214755d0341 vllm commit id:b124e1085b1bf977e3dac96d99ffd9d8ddfdb6cc + vllm-project#910 + [Reduce _npu_flash_attention mask to 128x128 for memory savings] vllm-project#1100 [Reduce memory usage by splitting tokens in fused_experts] --------- Signed-off-by: ttanzhiqiang <389825161@qq.com>

…project#1159) Fix the doc typo in graph_mode.md Signed-off-by: yzim <43207690+yzim@users.noreply.github.com>

@ttanzhiqiang

…-project#1098) What this PR does / why we need it? Enable kvcache_nz for the decode process in torchair graph mode, which reduces the time consumed by FA in long sequences. Does this PR introduce any user-facing change? If need to enable kvcache_nz, should set the additional_config.torchair_graph_config.enable_kv_nz=True How was this patch tested? 1. Tested in deepseek model: with batchsize 64 and seq_len 1k+3k, 61 layers FA total time improves 20.80ms -> 19.76ms 2. operator precision test: [aclnnFusedInferAttentionScoreV3_result.csv](https://github.com/user-attachments/files/20664138/aclnnFusedInferAttentionScoreV3_result.csv) 3. tpot test from @ttanzhiqiang, and curl one result is normal vllm-project#1098 (comment) vllm-project#1098 (comment) --------- Signed-off-by: chenwaner <861645847@qq.com>

1. upgrade vllm to 0.9.1. 0.9.0 is not supported for main branch now. keep doc to 0.9.0 until we release the first 0.9.1 release. 2. disable V0 test for PR 3. move actionlint check to lint job Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>

fix bugs in fused_experts_with_all2all

fix bug when running benchmark by move forward_before behind return o…

fix SwiftBalancer eplb algo

fix get_expert_load

expert load collecting

collect moe load after dispatch

modify serialization of eplb process

improve d2d expert weight update impl in eplb_updator.py

github-actions · 2025-07-08T02:56:20Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

22dimensions and others added 30 commits June 10, 2025 10:07

[CI] remove old quantization model (vllm-project#1003)

5cd5d64

remove old quantization model, and new models will be added to testcase later. Signed-off-by: 22dimensions <waitingwind@foxmail.com>

Update 0.9.0rc1 contributors info (vllm-project#1148)

71aee6f

### What this PR does / why we need it? Update 0.9.0rc1 contributors info ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed Signed-off-by: Yikun Jiang <yikunkero@gmail.com>

[CI] Make accuarcy CI and report work (vllm-project#1078)

e68e81f

### What this PR does / why we need it? Make accuarcy CI and report work ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manaully review Signed-off-by: hfadzxy <starmoon_zhang@163.com>

[CI] Run e2e after pre check pass (vllm-project#1132)

95414ba

Make sure the lint test passed before start the e2e test to save compute resource. Updated the patch doc to make sure the CI works as expect. Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>

add eplb policy

2a94321

add eplb updator

e91956f

implementation of VllmEplbAdaptor and D2DExpertWeightLoader

66b3d2e

add eplb policy and updator

05a536c

add eplb policy and updator

Merge pull request #39 from raindaywhu/dev_whq_eplb

24ca412

implementation of VllmEplbAdaptor and D2DExpertWeightLoader

determine num_dense_layers and num_moe_layers by refering to model co…

86fe2c0

…nfiguration

Merge pull request #41 from raindaywhu/dev_whq_eplb

caeaf2c

determine num_dense_layers and num_moe_layers by refering to model co…

EPLB add eplb_worker

e68e522

Merge pull request #42 from raindaywhu/dev_mereg_wjh

f450936

EPLB add eplb_worker

add ssd loader

d639144

EPLB moed load collect

f1f936b

delete invalida import

bd924f2

Merge pull request #43 from raindaywhu/dev_mereg_wjh

7e9bb54

Dev mereg wjh

[CI] rename Qwen2.5-0.5B-Instruct-W8A8 model (vllm-project#1145)

8b48daa

1. rename vllm-ascend/Qwen2.5-0.5B-Instruct-W8A8-new to vllm-ascend/Qwen2.5-0.5B-Instruct-W8A8 Signed-off-by: 22dimensions <waitingwind@foxmail.com>

[CI] Skip test_v1_spec_decode.py::test_ngram_correctness to make long…

04abfd8

…term CI pass (vllm-project#1163) [CI] Skip test_v1_spec_decode.py::test_ngram_correctness to make longterm CI pass Related: vllm-project#1162 Signed-off-by: MengqingCao <cmq0113@163.com>

provide an e2e guide for execute duration profiling (vllm-project#1113)

860a5ef

### What this PR does / why we need it? provide an e2e guide for execute duration profiling Signed-off-by: depeng1994 <depengzhang@foxmail.com>

[Doc] Fix the config parameter name "enable" in graph_mode.md. (vllm-…

4153a50

…project#1159) Fix the doc typo in graph_mode.md Signed-off-by: yzim <43207690+yzim@users.noreply.github.com>

fix bugs in fused_experts_with_all2all

afcce8e

Merge pull request #44 from raindaywhu/dev_whq_eplb

bca9b34

fix bugs in fused_experts_with_all2all

wanghanqingLYT and others added 19 commits June 25, 2025 14:52

Merge pull request #103 from raindaywhu/dev_whq_eplb1

c57611c

fix bug when running benchmark by move forward_before behind return o…

fix SwiftBalancer eplb algo

1f0b980

Merge pull request #104 from raindaywhu/new_dev_main_cy

bfa07cf

fix SwiftBalancer eplb algo

update get_expert_load logic

e7b7186

fix get_expert_load

d018ec8

delete invaild print

6a0a05e

delete empty tensor judgement

1547810

Merge pull request #105 from raindaywhu/br_main_into_eplb_wjh

1b7b87b

fix get_expert_load

merge from remote default branch and fix conflict

969751a

merge default branch and fix conflict

b0e68f7

relocate the code from the worker_runner to the server side.

3465ad6

Merge pull request #99 from raindaywhu/lt_expert_load

0bab2cd

expert load collecting

collect moe load after dispatch

ad5e7e1

Merge branch 'br_main_into_eplb' into dev_whq_eplb2

e4cba5e

Merge pull request #106 from raindaywhu/dev_whq_eplb2

75992b9

collect moe load after dispatch

modify serialization of eplb process

89bcf04

Merge pull request #107 from raindaywhu/dev_whq_eplb2

cfbe8b1

modify serialization of eplb process

improve d2d expert weight update impl in eplb_updator.py

2b62a47

Merge pull request #108 from raindaywhu/dev_whq_eplb1

d79ace8

improve d2d expert weight update impl in eplb_updator.py

github-actions bot added module:ops module:core module:quantization labels Jul 7, 2025

merge update

3dc10ef

shiyuan680 force-pushed the new_091 branch from 9c1d673 to 3dc10ef Compare July 8, 2025 02:55

github-actions bot added documentation Improvements or additions to documentation ci/build module:tests merge-conflicts labels Jul 8, 2025

shiyuan680 closed this Jul 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

New 091 #1658

New 091 #1658

Uh oh!

shiyuan680 commented Jul 7, 2025

Uh oh!

github-actions bot commented Jul 8, 2025

Uh oh!

Uh oh!

New 091 #1658

New 091 #1658

Uh oh!

Conversation

shiyuan680 commented Jul 7, 2025

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

github-actions bot commented Jul 8, 2025

Uh oh!

Uh oh!