Skip to content

[Kernels] MoE refactor #19636

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 72 commits into from
Jul 2, 2025
Merged

[Kernels] MoE refactor #19636

merged 72 commits into from
Jul 2, 2025

Conversation

bnellnm
Copy link
Contributor

@bnellnm bnellnm commented Jun 14, 2025

  • add FusedMoEQuantConfig for all quantization parameters
  • move MoE config data structures to config.py
  • add activation format method + enum, so that PrepareAndFinalize objects can verify that they work with a particular Experts object.
  • refactor tests so all MoE tests are under tests/kernels/moe
  • add more tests

Some lm-eval results

DeepGemm + DeepEP high throughput

lm_eval --model local-completions --tasks gsm8k --model_args model=Qwen/Qwen3-30B-A3B-FP8,base_url=http://127.0.0.1:9010/v1/completions,num_concurrent=30,max_retries=3,tokenized_requests=False --limit 100

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  | 0.85|±  |0.0359|
|     |       |strict-match    |     5|exact_match|↑  | 0.92|±  |0.0273|

DeepGemm + DeepEP low latency

lm_eval --model local-completions --tasks gsm8k --model_args model=Qwen/Qwen3-30B-A3B-FP8,base_url=http://127.0.0.1:9010/v1/completions,num_concurrent=30,max_retries=3,tokenized_requests=False --limit 100

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  | 0.85|±  |0.0359|
|     |       |strict-match    |     5|exact_match|↑  | 0.93|±  |0.0256|

Triton + PPLX

lm_eval --model local-completions --tasks gsm8k --model_args model=deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct,base_url=http://127.0.0.1:9010/v1/completions,num_concurrent=30,max_retries=3,tokenized_requests=False --limit 100

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  | 0.75|±  |0.0435|
|     |       |strict-match    |     5|exact_match|↑  | 0.73|±  |0.0446|

Note: PPLX doesn't currently work with the DeepGemm kernels. This should be (mostly) addressed in #18864

Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @bnellnm, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly refactors the internal structure and testing of the fused MoE layers. The primary goal is to improve the organization and consistency of configuration parameters, especially those related to quantization and parallel execution strategies. By introducing dedicated config objects and enhancing the modular kernel interface, the changes aim to make the MoE implementation more maintainable and easier to extend with new kernels and quantization methods. The accompanying test refactor and expansion ensure broader coverage and better validation of these complex kernels.

Highlights

  • Configuration Refactor: Introduced dedicated dataclasses (FusedMoEConfig, FusedMoEParallelConfig, FusedMoEQuantConfig) to centralize and standardize MoE configuration parameters, including parallelism and quantization settings. These structures were moved to a new vllm/model_executor/layers/fused_moe/config.py file.
  • Modular Kernel Enhancement: Enhanced the modular kernel framework (FusedMoEModularKernel) by adding an FusedMoEActivationFormat enum and corresponding properties to FusedMoEPrepareAndFinalize and FusedMoEPermuteExpertsUnpermute base classes. This allows kernels to declare their expected input/output activation formats and enables verification within the modular kernel wrapper.
  • Quantization Parameter Handling: Updated various MoE kernel implementations (Triton, DeepGemm, Cutlass, PPLX, DeepEP) and their associated prepare/finalize steps to accept and utilize the new FusedMoEQuantConfig object, ensuring consistent handling of quantization parameters like quant_dtype, per_act_token_quant, and block_shape.
  • Test Refactoring and Expansion: Consolidated MoE-related tests into a dedicated tests/kernels/moe directory. Added new test files (test_block_fp8.py, test_block_int8.py, utils.py) and expanded existing tests (test_batched_moe.py, test_cutlass_moe.py, test_pplx_moe.py) to improve coverage, particularly for block-wise FP8 and INT8 quantization and DeepGemm integration.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configureGemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant refactoring of MoE (Mixture of Experts) layers, primarily focusing on centralizing quantization parameters into a new FusedMoEQuantConfig data structure. This change aims to improve the modularity and clarity of MoE configurations.

Key changes include:

  • Introduction of FusedMoEQuantConfig, FusedMoEParallelConfig, and FusedMoEConfig in a new config.py file to manage MoE parameters.
  • Updates to various MoE expert implementations (Triton, Cutlass, DeepGEMM, DeepEP) and their corresponding PrepareAndFinalize classes to utilize these new config objects.
  • Refactoring of the modular kernel interface (FusedMoEPermuteExpertsUnpermute, FusedMoEPrepareAndFinalize) to better integrate with the new configuration system and to introduce an activation_format concept for verifying compatibility between prepare/finalize and expert execution steps.
  • Expansion and refactoring of MoE tests, particularly to cover FP8 and INT8 block quantization, and to test PPLX kernels with quantization. Test utilities have been centralized.

Overall, the refactoring appears to be well-structured and moves towards a more maintainable MoE implementation. The main feedback points revolve around ensuring all parts of the code correctly adapt to the new configuration system, addressing potential dead code, and clarifying paths that might not yet be fully implemented (NYI assertions).

A critical issue was found in the new config.py due to a missing logger import, which would cause a runtime error. Other points relate to code clarity, potential redundancies, and ensuring all quantization paths are correctly handled or explicitly marked as not yet implemented.

Copy link

mergify bot commented Jun 24, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @bnellnm.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Jun 24, 2025
Comment on lines +94 to +112
class FusedMoEParallelConfig:
tp_size: int
dp_size: int
ep_size: int
tp_rank: int
dp_rank: int
ep_rank: int
world_size: int
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the tp_size and the dp_size are actually properties of the Attn layers rather than the MoE layers?

This is potentially confusing as we will likely in the future have both TP+EP in the MoE layer itself. See #20037

Maybe we should break things down to have an AttnParallelConfig and a FusedMoE parallel config?

Comment on lines 42 to 44
# per-tensor
# ?
hidden_scale_bytes = round_up(elem_size, align)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This must be per token + some tweaks for alignment issues -- @ElizaWszola could you fill in what this comment should say?

Copy link

mergify bot commented Jun 26, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @bnellnm.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Jun 26, 2025
@mergify mergify bot added performance Performance-related issues and removed needs-rebase labels Jun 27, 2025
@bnellnm bnellnm requested review from mgoin and tlrmchlsmth June 27, 2025 20:38
bnellnm added 14 commits July 2, 2025 02:27
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
auto-merge was automatically disabled July 2, 2025 02:31

Head branch was pushed to by a user without write access

@tlrmchlsmth tlrmchlsmth enabled auto-merge (squash) July 2, 2025 02:52
Signed-off-by: Bill Nell <bnell@redhat.com>
auto-merge was automatically disabled July 2, 2025 03:10

Head branch was pushed to by a user without write access

Signed-off-by: Bill Nell <bnell@redhat.com>
@vllm-bot vllm-bot merged commit c1909e7 into vllm-project:main Jul 2, 2025
74 of 78 checks passed
@huydhn
Copy link
Contributor

huydhn commented Jul 3, 2025

@bnellnm I start to see this error AttributeError: 'CompressedTensorsW8A8Fp8MoECutlassMethod' object has no attribute 'topk_indices_dtype' showing up after this is merged in c1909e7 when trying to serve meta-llama/llama-4-maverick-17b-128e-instruct-fp8. For example, https://github.com/pytorch/pytorch-integration-testing/actions/runs/16038022394/job/45253967329#step:14:2856. Any thoughts? I could create an issue for this if needed

cc @houseroad @yeqcharlotte

@mgoin
Copy link
Member

mgoin commented Jul 3, 2025

@huydhn please see if this PR fixes your issue #20381

EDIT: sorry I meant to link #20166

@minosfuture
Copy link
Contributor

@bnellnm I start to see this error AttributeError: 'CompressedTensorsW8A8Fp8MoECutlassMethod' object has no attribute 'topk_indices_dtype' showing up after this is merged in c1909e7 when trying to serve meta-llama/llama-4-maverick-17b-128e-instruct-fp8. For example, https://github.com/pytorch/pytorch-integration-testing/actions/runs/16038022394/job/45253967329#step:14:2856. Any thoughts? I could create an issue for this if needed

cc @houseroad

I rebased #20166 and fixed the same issue. Should be good now if you patch it.

@huydhn
Copy link
Contributor

huydhn commented Jul 4, 2025

Just another note that this failure also manifests on ROCm https://github.com/pytorch/pytorch-integration-testing/actions/runs/16063072569/job/45332643703#step:14:13108 after 78fe775

@@ -330,22 +355,18 @@ def cutlass_moe_fp8(
Returns:
- torch.Tensor: The fp16 output tensor after applying the MoE layer.
"""
per_act_token = a1_scale.numel() != 1 if a1_scale is not None else (
Copy link
Collaborator

@luccafong luccafong Jul 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when we remove this, no default default value of per_act_token assigned, also not assigning correct value from the caller like in

return self.fused_experts(
x,
layer.w13_weight,
layer.w2_weight,
topk_weights,
topk_ids,
activation=activation,
, could you please fix it?

@huydhn
Copy link
Contributor

huydhn commented Jul 7, 2025

@bnellnm I start to see this error AttributeError: 'CompressedTensorsW8A8Fp8MoECutlassMethod' object has no attribute 'topk_indices_dtype' showing up after this is merged in c1909e7 when trying to serve meta-llama/llama-4-maverick-17b-128e-instruct-fp8. For example, https://github.com/pytorch/pytorch-integration-testing/actions/runs/16038022394/job/45253967329#step:14:2856. Any thoughts? I could create an issue for this if needed

The issue has been fixed after #20509

huydhn pushed a commit to huydhn/vllm that referenced this pull request Jul 8, 2025
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: ElizaWszola <ewszola@redhat.com>
Co-authored-by: ElizaWszola <ewszola@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci/build performance Performance-related issues ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants