Skip to content

[0.9.1][Feature]Moe alltoallv communication optimization for unquantized RL training sence & alltoallv support dpo #1547

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 41 commits into
base: v0.9.1-dev
Choose a base branch
from

Conversation

weijinqian0
Copy link
Contributor

@weijinqian0 weijinqian0 commented Jul 1, 2025

[Feature]Moe alltoallv communication optimization for unquantized RL training sence & alltoallv support dpo

Introduction

This PR introduces two key optimizations for MoE model performance:

  1. Efficient Token Dispatcher:

    • Implements an optimized alltoallv_seq token dispatcher (adopted from NVIDIA Megatron and Ascend MindSpeed)
    • Significantly more efficient than current alltoall implementation when using token_permute/unpermute fusion
    • Enable with: VLLM_ASCEND_ENABLE_MOE_ALL2ALL_SEQ=1
  2. DBO Support for alltoallv_seq:

    • Builds upon the alltoallv_seq dispatcher to support DBO (Dual Batch Overlap)
    • Enables overlapping of alltoallv communication during the prefilling stage
    • Enable with both:
      • VLLM_ASCEND_ENABLE_MOE_ALL2ALL_SEQ=1
      • VLLM_ASCEND_ENABLE_DBO=1

Performance Improvements

Testing on Qwen3-30B-A3B shows nearly 2x throughput improvement compared to the original alltoall implementation.

weijinqian_v1 added 12 commits July 1, 2025 09:51
…training sence & alltoallv support dpo

Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
…training sence & alltoallv support dpo

Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
…training sence & alltoallv support dpo

Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
…training sence & alltoallv support dpo

Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
…training sence & alltoallv support dpo

Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
…training sence & alltoallv support dpo

Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
…training sence & alltoallv support dpo

Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
…training sence & alltoallv support dpo

Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
…training sence & alltoallv support dpo

Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
…training sence & alltoallv support dpo

Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
…training sence & alltoallv support dpo

Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
…training sence & alltoallv support dpo

Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
weijinqian_v1 added 3 commits July 1, 2025 14:03
…training sence & alltoallv support dpo

Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
…training sence & alltoallv support dpo

Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
…training sence & alltoallv support dpo

Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
Copy link

github-actions bot commented Jul 3, 2025

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
Copy link

github-actions bot commented Jul 8, 2025

This pull request has conflicts, please resolve those before we can evaluate the pull request.

weijinqian_v1 and others added 5 commits July 9, 2025 16:25
Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
harygo22 added 6 commits July 9, 2025 16:28
Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
* ut test

* liscense & fix dsk dbo.

Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
weijinqian_v1 added 2 commits July 9, 2025 16:41
Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
Copy link

github-actions bot commented Jul 9, 2025

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
weijinqian_v1 added 2 commits July 9, 2025 23:48
Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
Copy link

github-actions bot commented Jul 9, 2025

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Signed-off-by: duyangkai <duyangkai@huawei.com>
@wangxiyuan wangxiyuan changed the title [Feature]Moe alltoallv communication optimization for unquantized RL training sence & alltoallv support dpo [0.9.1][Feature]Moe alltoallv communication optimization for unquantized RL training sence & alltoallv support dpo Jul 10, 2025
elif envs_ascend.VLLM_ASCEND_ENABLE_MOE_ALL2ALL_SEQ:
# MC2 Dispatch/Combine performs better than alltoall_seq in decoding stage.
return FusedMoEState.All2AllSeq if (
ep_size < 16 or with_prefill) else FusedMoEState.MC2
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why there is this restriction ep_size <16 ?

Copy link

@harygo22 harygo22 Jul 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MC2 Dispatch/Combine is still faster than alltoall_seq in decoding stage. so when ep_size >= 16, use MC2 for better performance.

Copy link

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Signed-off-by: duyangkai <duyangkai@huawei.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants