-
-
Notifications
You must be signed in to change notification settings - Fork 8.8k
[Kernels] MoE refactor #19636
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Kernels] MoE refactor #19636
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @bnellnm, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request significantly refactors the internal structure and testing of the fused MoE layers. The primary goal is to improve the organization and consistency of configuration parameters, especially those related to quantization and parallel execution strategies. By introducing dedicated config objects and enhancing the modular kernel interface, the changes aim to make the MoE implementation more maintainable and easier to extend with new kernels and quantization methods. The accompanying test refactor and expansion ensure broader coverage and better validation of these complex kernels.
Highlights
- Configuration Refactor: Introduced dedicated dataclasses (
FusedMoEConfig
,FusedMoEParallelConfig
,FusedMoEQuantConfig
) to centralize and standardize MoE configuration parameters, including parallelism and quantization settings. These structures were moved to a newvllm/model_executor/layers/fused_moe/config.py
file. - Modular Kernel Enhancement: Enhanced the modular kernel framework (
FusedMoEModularKernel
) by adding anFusedMoEActivationFormat
enum and corresponding properties toFusedMoEPrepareAndFinalize
andFusedMoEPermuteExpertsUnpermute
base classes. This allows kernels to declare their expected input/output activation formats and enables verification within the modular kernel wrapper. - Quantization Parameter Handling: Updated various MoE kernel implementations (Triton, DeepGemm, Cutlass, PPLX, DeepEP) and their associated prepare/finalize steps to accept and utilize the new
FusedMoEQuantConfig
object, ensuring consistent handling of quantization parameters likequant_dtype
,per_act_token_quant
, andblock_shape
. - Test Refactoring and Expansion: Consolidated MoE-related tests into a dedicated
tests/kernels/moe
directory. Added new test files (test_block_fp8.py
,test_block_int8.py
,utils.py
) and expanded existing tests (test_batched_moe.py
,test_cutlass_moe.py
,test_pplx_moe.py
) to improve coverage, particularly for block-wise FP8 and INT8 quantization and DeepGemm integration.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configureGemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a significant refactoring of MoE (Mixture of Experts) layers, primarily focusing on centralizing quantization parameters into a new FusedMoEQuantConfig
data structure. This change aims to improve the modularity and clarity of MoE configurations.
Key changes include:
- Introduction of
FusedMoEQuantConfig
,FusedMoEParallelConfig
, andFusedMoEConfig
in a newconfig.py
file to manage MoE parameters. - Updates to various MoE expert implementations (Triton, Cutlass, DeepGEMM, DeepEP) and their corresponding
PrepareAndFinalize
classes to utilize these new config objects. - Refactoring of the modular kernel interface (
FusedMoEPermuteExpertsUnpermute
,FusedMoEPrepareAndFinalize
) to better integrate with the new configuration system and to introduce anactivation_format
concept for verifying compatibility between prepare/finalize and expert execution steps. - Expansion and refactoring of MoE tests, particularly to cover FP8 and INT8 block quantization, and to test PPLX kernels with quantization. Test utilities have been centralized.
Overall, the refactoring appears to be well-structured and moves towards a more maintainable MoE implementation. The main feedback points revolve around ensuring all parts of the code correctly adapt to the new configuration system, addressing potential dead code, and clarifying paths that might not yet be fully implemented (NYI assertions).
A critical issue was found in the new config.py
due to a missing logger import, which would cause a runtime error. Other points relate to code clarity, potential redundancies, and ensuring all quantization paths are correctly handled or explicitly marked as not yet implemented.
This pull request has merge conflicts that must be resolved before it can be |
class FusedMoEParallelConfig: | ||
tp_size: int | ||
dp_size: int | ||
ep_size: int | ||
tp_rank: int | ||
dp_rank: int | ||
ep_rank: int | ||
world_size: int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the tp_size and the dp_size are actually properties of the Attn layers rather than the MoE layers?
This is potentially confusing as we will likely in the future have both TP+EP in the MoE layer itself. See #20037
Maybe we should break things down to have an AttnParallelConfig and a FusedMoE parallel config?
vllm/model_executor/layers/fused_moe/deepep_ll_prepare_finalize.py
Outdated
Show resolved
Hide resolved
vllm/model_executor/layers/fused_moe/deepep_ll_prepare_finalize.py
Outdated
Show resolved
Hide resolved
# per-tensor | ||
# ? | ||
hidden_scale_bytes = round_up(elem_size, align) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This must be per token + some tweaks for alignment issues -- @ElizaWszola could you fill in what this comment should say?
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
Head branch was pushed to by a user without write access
Signed-off-by: Bill Nell <bnell@redhat.com>
Head branch was pushed to by a user without write access
Signed-off-by: Bill Nell <bnell@redhat.com>
@bnellnm I start to see this error |
I rebased #20166 and fixed the same issue. Should be good now if you patch it. |
Just another note that this failure also manifests on ROCm https://github.com/pytorch/pytorch-integration-testing/actions/runs/16063072569/job/45332643703#step:14:13108 after 78fe775 |
@@ -330,22 +355,18 @@ def cutlass_moe_fp8( | |||
Returns: | |||
- torch.Tensor: The fp16 output tensor after applying the MoE layer. | |||
""" | |||
per_act_token = a1_scale.numel() != 1 if a1_scale is not None else ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when we remove this, no default default value of per_act_token assigned, also not assigning correct value from the caller like in
vllm/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py
Lines 934 to 940 in 7e90870
return self.fused_experts( | |
x, | |
layer.w13_weight, | |
layer.w2_weight, | |
topk_weights, | |
topk_ids, | |
activation=activation, |
The issue has been fixed after #20509 |
Signed-off-by: Bill Nell <bnell@redhat.com> Signed-off-by: ElizaWszola <ewszola@redhat.com> Co-authored-by: ElizaWszola <ewszola@redhat.com>
Some lm-eval results
DeepGemm + DeepEP high throughput
DeepGemm + DeepEP low latency
Triton + PPLX
Note: PPLX doesn't currently work with the DeepGemm kernels. This should be (mostly) addressed in #18864