[Refactor] Collect scattered w8a8-dynamic quantization operations #1111

sdmyzlp · 2025-06-07T07:08:01Z

What this PR does / why we need it?

The current integration of AscendW8A8DynamicLinearMethod into CustomDeepseekV2MLP is fragile. Refactor to really utilize the quant_method by calling apply functions of various vllm-predefined LinearBase subclasses.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Added an e2e test test_models_distributed_DeepSeek_W8A8, using vllm-ascend/DeepSeek-V2-Lite-W8A8 for inference.

github-actions · 2025-06-08T03:13:14Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

github-actions · 2025-06-08T14:35:49Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

sdmyzlp · 2025-06-09T08:08:00Z

@Yikun @wangxiyuan @ganyi1996ppo Please take a look at this.

wangxiyuan · 2025-06-09T13:31:43Z

there is a w8a8 weight already. https://www.modelscope.cn/models/vllm-ascend/DeepSeek-V2-Lite-W8A8 Can you take a try?

sdmyzlp · 2025-06-10T15:15:22Z

there is a w8a8 weight already. https://www.modelscope.cn/models/vllm-ascend/DeepSeek-V2-Lite-W8A8 Can you take a try?

E2E testcase added. @wangxiyuan

AscendW8A8DynamicLinearMethod is integrated into CustomDeepseekV2MLP in a very awkward way, causing scattered quantization operations all over the model scripts. Refactor to solve this problem. Signed-off-by: sdmyzlp <lrwei2@petalmail.com>

The model of chosen is vllm-ascend/DeepSeek-V2-Lite-W8A8. Signed-off-by: sdmyzlp <lrwei2@petalmail.com>

Contains on #1111 for completeness.  ### What this PR does / why we need it? Implement multi-stream parallelism for MoE layers with shared experts, where computation of shared experts will be overlapped with expert token dispatch and combine. Also, when multi-stream is enabled, weights of shared experts will be force to replicate across all cards, regardless of any tensor parallelism configurations, to avoid AllReduce operations. With the expected overlaping being: ``` | shared gate_up | shared act | | shared down | | dispatch | routed gate_up, act, down | combine | ```  ### Does this PR introduce _any_ user-facing change? No.  ### How was this patch tested? Tested on 1x16 910 node, with tailored 2 layer DSKv2.  --------- Signed-off-by: sdmyzlp <lrwei2@petalmail.com>

github-actions · 2025-06-11T01:20:55Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

sdmyzlp · 2025-06-11T01:28:10Z

Closed due to its super set #997 has been merged.

Contains on vllm-project#1111 for completeness.  Implement multi-stream parallelism for MoE layers with shared experts, where computation of shared experts will be overlapped with expert token dispatch and combine. Also, when multi-stream is enabled, weights of shared experts will be force to replicate across all cards, regardless of any tensor parallelism configurations, to avoid AllReduce operations. With the expected overlaping being: ``` | shared gate_up | shared act | | shared down | | dispatch | routed gate_up, act, down | combine | ```  No.  Tested on 1x16 910 node, with tailored 2 layer DSKv2.  --------- Signed-off-by: sdmyzlp <lrwei2@petalmail.com>

Contains on vllm-project#1111 for completeness.  ### What this PR does / why we need it? Implement multi-stream parallelism for MoE layers with shared experts, where computation of shared experts will be overlapped with expert token dispatch and combine. Also, when multi-stream is enabled, weights of shared experts will be force to replicate across all cards, regardless of any tensor parallelism configurations, to avoid AllReduce operations. With the expected overlaping being: ``` | shared gate_up | shared act | | shared down | | dispatch | routed gate_up, act, down | combine | ```  ### Does this PR introduce _any_ user-facing change? No.  ### How was this patch tested? Tested on 1x16 910 node, with tailored 2 layer DSKv2.  --------- Signed-off-by: sdmyzlp <lrwei2@petalmail.com>

Contains on vllm-project#1111 for completeness.  ### What this PR does / why we need it? Implement multi-stream parallelism for MoE layers with shared experts, where computation of shared experts will be overlapped with expert token dispatch and combine. Also, when multi-stream is enabled, weights of shared experts will be force to replicate across all cards, regardless of any tensor parallelism configurations, to avoid AllReduce operations. With the expected overlaping being: ``` | shared gate_up | shared act | | shared down | | dispatch | routed gate_up, act, down | combine | ```  ### Does this PR introduce _any_ user-facing change? No.  ### How was this patch tested? Tested on 1x16 910 node, with tailored 2 layer DSKv2.  --------- Signed-off-by: sdmyzlp <lrwei2@petalmail.com> Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>

Contains on vllm-project#1111 for completeness.  Implement multi-stream parallelism for MoE layers with shared experts, where computation of shared experts will be overlapped with expert token dispatch and combine. Also, when multi-stream is enabled, weights of shared experts will be force to replicate across all cards, regardless of any tensor parallelism configurations, to avoid AllReduce operations. With the expected overlaping being: ``` | shared gate_up | shared act | | shared down | | dispatch | routed gate_up, act, down | combine | ```  No.  Tested on 1x16 910 node, with tailored 2 layer DSKv2.  --------- Signed-off-by: sdmyzlp <lrwei2@petalmail.com>

github-actions bot added documentation Improvements or additions to documentation module:tests module:ops module:core module:quantization labels Jun 7, 2025

sdmyzlp changed the title ~~Br quant refactor~~ [Refactor] Collect scattered w8a8-dynamic quantization operations Jun 7, 2025

sdmyzlp force-pushed the br_quant_refactor branch from 5ed12bb to 9b37507 Compare June 7, 2025 07:32

github-actions bot removed documentation Improvements or additions to documentation module:tests module:ops module:core labels Jun 7, 2025

sdmyzlp force-pushed the br_quant_refactor branch 3 times, most recently from 4192815 to 542bd18 Compare June 8, 2025 03:13

github-actions bot added merge-conflicts module:tests labels Jun 8, 2025

sdmyzlp force-pushed the br_quant_refactor branch from 542bd18 to e82a6ce Compare June 8, 2025 03:45

github-actions bot removed the merge-conflicts label Jun 8, 2025

sdmyzlp force-pushed the br_quant_refactor branch from e82a6ce to d65186f Compare June 8, 2025 06:36

github-actions bot added the merge-conflicts label Jun 8, 2025

sdmyzlp force-pushed the br_quant_refactor branch from d65186f to edb6680 Compare June 8, 2025 15:19

github-actions bot removed the merge-conflicts label Jun 8, 2025

sdmyzlp force-pushed the br_quant_refactor branch 2 times, most recently from e5b5e83 to 976e1a7 Compare June 9, 2025 00:56

github-actions bot removed the module:tests label Jun 9, 2025

sdmyzlp mentioned this pull request Jun 9, 2025

Support multistream of shared experts in FusedMoE #997

Merged

sdmyzlp force-pushed the br_quant_refactor branch from 976e1a7 to c0acc46 Compare June 10, 2025 04:05

github-actions bot added the module:tests label Jun 10, 2025

sdmyzlp force-pushed the br_quant_refactor branch 7 times, most recently from 41a3600 to a4200e0 Compare June 10, 2025 13:41

sdmyzlp added 2 commits June 11, 2025 01:02

Add W8A8_DYNAMIC quantization inference test

2a10573

The model of chosen is vllm-ascend/DeepSeek-V2-Lite-W8A8. Signed-off-by: sdmyzlp <lrwei2@petalmail.com>

sdmyzlp force-pushed the br_quant_refactor branch from a4200e0 to 2a10573 Compare June 10, 2025 17:03

github-actions bot added the merge-conflicts label Jun 11, 2025

ganyi1996ppo mentioned this pull request Jun 11, 2025

Revert "Support multistream of shared experts in FusedMoE" #1166

Closed

sdmyzlp closed this Jun 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Refactor] Collect scattered w8a8-dynamic quantization operations #1111

[Refactor] Collect scattered w8a8-dynamic quantization operations #1111

Uh oh!

sdmyzlp commented Jun 7, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Jun 8, 2025

Uh oh!

github-actions bot commented Jun 8, 2025

Uh oh!

sdmyzlp commented Jun 9, 2025

Uh oh!

wangxiyuan commented Jun 9, 2025

Uh oh!

sdmyzlp commented Jun 10, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Jun 11, 2025

Uh oh!

sdmyzlp commented Jun 11, 2025

Uh oh!

Uh oh!

[Refactor] Collect scattered w8a8-dynamic quantization operations #1111

[Refactor] Collect scattered w8a8-dynamic quantization operations #1111

Uh oh!

Conversation

sdmyzlp commented Jun 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

github-actions bot commented Jun 8, 2025

Uh oh!

github-actions bot commented Jun 8, 2025

Uh oh!

sdmyzlp commented Jun 9, 2025

Uh oh!

wangxiyuan commented Jun 9, 2025

Uh oh!

sdmyzlp commented Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jun 11, 2025

Uh oh!

sdmyzlp commented Jun 11, 2025

Uh oh!

Uh oh!

sdmyzlp commented Jun 7, 2025 •

edited

Loading

sdmyzlp commented Jun 10, 2025 •

edited

Loading