Support multistream of shared experts in FusedMoE #997

sdmyzlp · 2025-05-29T01:46:51Z

Contains on #1111 for completeness.

What this PR does / why we need it?

Implement multi-stream parallelism for MoE layers with shared experts, where computation of shared experts will be overlapped with expert token dispatch and combine. Also, when multi-stream is enabled, weights of shared experts will be force to replicate across all cards, regardless of any tensor parallelism configurations, to avoid AllReduce operations.

With the expected overlaping being:

| shared gate_up | shared act |              | shared down |
|    dispatch    | routed gate_up, act, down |   combine   |

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Tested on 1x16 910 node, with tailored 2 layer DSKv2.

MengqingCao · 2025-05-30T02:34:20Z

You can run bash format.sh locally to fix lint failures

sdmyzlp · 2025-05-30T07:24:12Z

You can run bash format.sh locally to fix lint failures

Done, with mypy error code import-not-found locally disabled for vllm_ascend/utils.py.

ganyi1996ppo · 2025-05-30T09:51:48Z

@sdmyzlp Can you upload the profiling graph on this part? so we can have a more intuisive perspective on this PR.

ganyi1996ppo · 2025-05-30T09:58:19Z

vllm_ascend/ops/fused_moe.py

@@ -83,11 +85,20 @@ def fused_experts_with_mc2(
    }
    kwargs.update(stage1_kwargs)

+    if shared_experts is not None:


Have you tried to launch this after the dispatch op? If the dispatch will block enough time on the first stream, launch this after it seems can better overlap the execution part over host side right?

Fixed, with all secondary-stream operations launched after the corresponding main-stream operation.

By the way, this patch only means to implement multi-stream shared experts for graph mode decode, operations will still be executed sequentially otherwise. One may extend npu_switch_stream / npu_wait_tensor in future to support eager mode multi-stream functionality.

sdmyzlp · 2025-06-03T08:49:00Z

@sdmyzlp Can you upload the profiling graph on this part? so we can have a more intuisive perspective on this PR.

Described expected overlap using ascii in commit message, I have trouble uploading screenshot of the profiling.

github-actions · 2025-06-09T11:30:22Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

github-actions · 2025-06-10T00:41:37Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

AscendW8A8DynamicLinearMethod is integrated into CustomDeepseekV2MLP in a very awkward way, causing scattered quantization operations all over the model scripts. Refactor to solve this problem. Signed-off-by: sdmyzlp <lrwei2@petalmail.com>

The model of chosen is vllm-ascend/DeepSeek-V2-Lite-W8A8. Signed-off-by: sdmyzlp <lrwei2@petalmail.com>

Signed-off-by: sdmyzlp <lrwei2@petalmail.com>

1. Concentrate the usage of `enable_multistream_moe` to one single place, and sink the computation of shared experts to `self.experts()` when multistream MoE is enabled, regardless of decode or prefill. 2. Move computation of shared experts out of `apply_mlp`. Signed-off-by: sdmyzlp <lrwei2@petalmail.com>

Helps to unite pathes where multistream is turned on or off. Signed-off-by: sdmyzlp <lrwei2@petalmail.com>

As the replicated version of MergedColumnParallelLinear, aiming at removing TP communication of DeepSeek-V2's `gate_up_proj` linear. Also, with replicated weight, the chunked input hidden_states can be used by shared experts. Signed-off-by: sdmyzlp <lrwei2@petalmail.com>

This reverts commit 7bdc606.

Contains on vllm-project#1111 for completeness.  Implement multi-stream parallelism for MoE layers with shared experts, where computation of shared experts will be overlapped with expert token dispatch and combine. Also, when multi-stream is enabled, weights of shared experts will be force to replicate across all cards, regardless of any tensor parallelism configurations, to avoid AllReduce operations. With the expected overlaping being: ``` | shared gate_up | shared act | | shared down | | dispatch | routed gate_up, act, down | combine | ```  No.  Tested on 1x16 910 node, with tailored 2 layer DSKv2.  --------- Signed-off-by: sdmyzlp <lrwei2@petalmail.com>

Contains on vllm-project#1111 for completeness.  ### What this PR does / why we need it? Implement multi-stream parallelism for MoE layers with shared experts, where computation of shared experts will be overlapped with expert token dispatch and combine. Also, when multi-stream is enabled, weights of shared experts will be force to replicate across all cards, regardless of any tensor parallelism configurations, to avoid AllReduce operations. With the expected overlaping being: ``` | shared gate_up | shared act | | shared down | | dispatch | routed gate_up, act, down | combine | ```  ### Does this PR introduce _any_ user-facing change? No.  ### How was this patch tested? Tested on 1x16 910 node, with tailored 2 layer DSKv2.  --------- Signed-off-by: sdmyzlp <lrwei2@petalmail.com>

Contains on vllm-project#1111 for completeness.  ### What this PR does / why we need it? Implement multi-stream parallelism for MoE layers with shared experts, where computation of shared experts will be overlapped with expert token dispatch and combine. Also, when multi-stream is enabled, weights of shared experts will be force to replicate across all cards, regardless of any tensor parallelism configurations, to avoid AllReduce operations. With the expected overlaping being: ``` | shared gate_up | shared act | | shared down | | dispatch | routed gate_up, act, down | combine | ```  ### Does this PR introduce _any_ user-facing change? No.  ### How was this patch tested? Tested on 1x16 910 node, with tailored 2 layer DSKv2.  --------- Signed-off-by: sdmyzlp <lrwei2@petalmail.com> Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>

Contains on vllm-project#1111 for completeness.  Implement multi-stream parallelism for MoE layers with shared experts, where computation of shared experts will be overlapped with expert token dispatch and combine. Also, when multi-stream is enabled, weights of shared experts will be force to replicate across all cards, regardless of any tensor parallelism configurations, to avoid AllReduce operations. With the expected overlaping being: ``` | shared gate_up | shared act | | shared down | | dispatch | routed gate_up, act, down | combine | ```  No.  Tested on 1x16 910 node, with tailored 2 layer DSKv2.  --------- Signed-off-by: sdmyzlp <lrwei2@petalmail.com>

github-actions bot added module:ops module:core labels May 29, 2025

sdmyzlp force-pushed the br_infer_opt branch 8 times, most recently from 6ed1be1 to db76771 Compare May 30, 2025 01:50

MengqingCao requested a review from ganyi1996ppo May 30, 2025 01:57

sdmyzlp force-pushed the br_infer_opt branch 4 times, most recently from d6286aa to 52ba3d3 Compare May 30, 2025 07:13

sdmyzlp force-pushed the br_infer_opt branch from f758ee1 to 52ba3d3 Compare May 30, 2025 08:53

ganyi1996ppo reviewed May 30, 2025

View reviewed changes

sdmyzlp force-pushed the br_infer_opt branch 4 times, most recently from e2542aa to 7b28633 Compare June 1, 2025 03:33

wangxiyuan mentioned this pull request Jun 4, 2025

[release] 0.9.0rc1 release checklist #904

Open

76 tasks

sdmyzlp force-pushed the br_infer_opt branch 2 times, most recently from ec0553e to 1ef0f68 Compare June 4, 2025 06:20

github-actions bot added the module:tests label Jun 4, 2025

sdmyzlp force-pushed the br_infer_opt branch from 1ef0f68 to 261fe0a Compare June 4, 2025 06:23

sdmyzlp force-pushed the br_infer_opt branch 2 times, most recently from 41af4d9 to e4fe832 Compare June 9, 2025 08:58

sdmyzlp mentioned this pull request Jun 9, 2025

Support multistream of MLA vector operations #1135

Merged

github-actions bot added the merge-conflicts label Jun 9, 2025

sdmyzlp force-pushed the br_infer_opt branch 2 times, most recently from dbfdfe8 to fb0cdf2 Compare June 9, 2025 13:07

github-actions bot removed the merge-conflicts label Jun 9, 2025

ApsarasX approved these changes Jun 9, 2025

View reviewed changes

github-actions bot added the merge-conflicts label Jun 10, 2025

sdmyzlp force-pushed the br_infer_opt branch from fb0cdf2 to 2bee1fb Compare June 10, 2025 08:28

github-actions bot removed the merge-conflicts label Jun 10, 2025

sdmyzlp added 7 commits June 11, 2025 01:02

Add W8A8_DYNAMIC quantization inference test

2a10573

The model of chosen is vllm-ascend/DeepSeek-V2-Lite-W8A8. Signed-off-by: sdmyzlp <lrwei2@petalmail.com>

Naming changes for multistream MoE and misc clean up

623d18f

Signed-off-by: sdmyzlp <lrwei2@petalmail.com>

Introduce and use wrapper for torchair multistream utilities

922eb59

Helps to unite pathes where multistream is turned on or off. Signed-off-by: sdmyzlp <lrwei2@petalmail.com>

sdmyzlp force-pushed the br_infer_opt branch from 2bee1fb to 000050a Compare June 10, 2025 17:05

ganyi1996ppo merged commit 7bdc606 into vllm-project:main Jun 11, 2025
19 checks passed

ganyi1996ppo added a commit that referenced this pull request Jun 11, 2025

Revert "Support multistream of shared experts in FusedMoE (#997)"

8e70c20

This reverts commit 7bdc606.

ganyi1996ppo mentioned this pull request Jun 11, 2025

Revert "Support multistream of shared experts in FusedMoE" #1166

Closed

sdmyzlp mentioned this pull request Jun 11, 2025

[Refactor] Collect scattered w8a8-dynamic quantization operations #1111

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support multistream of shared experts in FusedMoE #997

Support multistream of shared experts in FusedMoE #997

Uh oh!

sdmyzlp commented May 29, 2025 •

edited

Loading

Uh oh!

MengqingCao commented May 30, 2025

Uh oh!

sdmyzlp commented May 30, 2025

Uh oh!

ganyi1996ppo commented May 30, 2025

Uh oh!

ganyi1996ppo May 30, 2025

Uh oh!

sdmyzlp Jun 1, 2025 •

edited

Loading

Uh oh!

sdmyzlp commented Jun 3, 2025

Uh oh!

github-actions bot commented Jun 9, 2025

Uh oh!

github-actions bot commented Jun 10, 2025

Uh oh!

Uh oh!

Uh oh!

Support multistream of shared experts in FusedMoE #997

Support multistream of shared experts in FusedMoE #997

Uh oh!

Conversation

sdmyzlp commented May 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

MengqingCao commented May 30, 2025

Uh oh!

sdmyzlp commented May 30, 2025

Uh oh!

ganyi1996ppo commented May 30, 2025

Uh oh!

ganyi1996ppo May 30, 2025

Choose a reason for hiding this comment

Uh oh!

sdmyzlp Jun 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sdmyzlp commented Jun 3, 2025

Uh oh!

github-actions bot commented Jun 9, 2025

Uh oh!

github-actions bot commented Jun 10, 2025

Uh oh!

Uh oh!

Uh oh!

sdmyzlp commented May 29, 2025 •

edited

Loading

sdmyzlp Jun 1, 2025 •

edited

Loading