[Kernels] MoE refactor #19636

bnellnm · 2025-06-14T03:25:16Z

add FusedMoEQuantConfig for all quantization parameters
move MoE config data structures to config.py
add activation format method + enum, so that PrepareAndFinalize objects can verify that they work with a particular Experts object.
refactor tests so all MoE tests are under tests/kernels/moe
add more tests

Some lm-eval results

DeepGemm + DeepEP high throughput

lm_eval --model local-completions --tasks gsm8k --model_args model=Qwen/Qwen3-30B-A3B-FP8,base_url=http://127.0.0.1:9010/v1/completions,num_concurrent=30,max_retries=3,tokenized_requests=False --limit 100

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  | 0.85|±  |0.0359|
|     |       |strict-match    |     5|exact_match|↑  | 0.92|±  |0.0273|

DeepGemm + DeepEP low latency

lm_eval --model local-completions --tasks gsm8k --model_args model=Qwen/Qwen3-30B-A3B-FP8,base_url=http://127.0.0.1:9010/v1/completions,num_concurrent=30,max_retries=3,tokenized_requests=False --limit 100

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  | 0.85|±  |0.0359|
|     |       |strict-match    |     5|exact_match|↑  | 0.93|±  |0.0256|

Triton + PPLX

lm_eval --model local-completions --tasks gsm8k --model_args model=deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct,base_url=http://127.0.0.1:9010/v1/completions,num_concurrent=30,max_retries=3,tokenized_requests=False --limit 100

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  | 0.75|±  |0.0435|
|     |       |strict-match    |     5|exact_match|↑  | 0.73|±  |0.0446|

Note: PPLX doesn't currently work with the DeepGemm kernels. This should be (mostly) addressed in #18864

github-actions · 2025-06-14T03:25:25Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Summary of Changes

Hello @bnellnm, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly refactors the internal structure and testing of the fused MoE layers. The primary goal is to improve the organization and consistency of configuration parameters, especially those related to quantization and parallel execution strategies. By introducing dedicated config objects and enhancing the modular kernel interface, the changes aim to make the MoE implementation more maintainable and easier to extend with new kernels and quantization methods. The accompanying test refactor and expansion ensure broader coverage and better validation of these complex kernels.

Highlights

Configuration Refactor: Introduced dedicated dataclasses (FusedMoEConfig, FusedMoEParallelConfig, FusedMoEQuantConfig) to centralize and standardize MoE configuration parameters, including parallelism and quantization settings. These structures were moved to a new vllm/model_executor/layers/fused_moe/config.py file.
Modular Kernel Enhancement: Enhanced the modular kernel framework (FusedMoEModularKernel) by adding an FusedMoEActivationFormat enum and corresponding properties to FusedMoEPrepareAndFinalize and FusedMoEPermuteExpertsUnpermute base classes. This allows kernels to declare their expected input/output activation formats and enables verification within the modular kernel wrapper.
Quantization Parameter Handling: Updated various MoE kernel implementations (Triton, DeepGemm, Cutlass, PPLX, DeepEP) and their associated prepare/finalize steps to accept and utilize the new FusedMoEQuantConfig object, ensuring consistent handling of quantization parameters like quant_dtype, per_act_token_quant, and block_shape.
Test Refactoring and Expansion: Consolidated MoE-related tests into a dedicated tests/kernels/moe directory. Added new test files (test_block_fp8.py, test_block_int8.py, utils.py) and expanded existing tests (test_batched_moe.py, test_cutlass_moe.py, test_pplx_moe.py) to improve coverage, particularly for block-wise FP8 and INT8 quantization and DeepGemm integration.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configureGemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a significant refactoring of MoE (Mixture of Experts) layers, primarily focusing on centralizing quantization parameters into a new FusedMoEQuantConfig data structure. This change aims to improve the modularity and clarity of MoE configurations.

Key changes include:

Introduction of FusedMoEQuantConfig, FusedMoEParallelConfig, and FusedMoEConfig in a new config.py file to manage MoE parameters.
Updates to various MoE expert implementations (Triton, Cutlass, DeepGEMM, DeepEP) and their corresponding PrepareAndFinalize classes to utilize these new config objects.
Refactoring of the modular kernel interface (FusedMoEPermuteExpertsUnpermute, FusedMoEPrepareAndFinalize) to better integrate with the new configuration system and to introduce an activation_format concept for verifying compatibility between prepare/finalize and expert execution steps.
Expansion and refactoring of MoE tests, particularly to cover FP8 and INT8 block quantization, and to test PPLX kernels with quantization. Test utilities have been centralized.

Overall, the refactoring appears to be well-structured and moves towards a more maintainable MoE implementation. The main feedback points revolve around ensuring all parts of the code correctly adapt to the new configuration system, addressing potential dead code, and clarifying paths that might not yet be fully implemented (NYI assertions).

A critical issue was found in the new config.py due to a missing logger import, which would cause a runtime error. Other points relate to code clarity, potential redundancies, and ensuring all quantization paths are correctly handled or explicitly marked as not yet implemented.

vllm/model_executor/layers/fused_moe/config.py

tests/kernels/moe/utils.py

vllm/model_executor/layers/fused_moe/__init__.py

vllm/model_executor/layers/fused_moe/fused_batched_moe.py

vllm/model_executor/layers/fused_moe/fused_moe.py

tests/kernels/moe/test_batched_moe.py

mergify · 2025-06-24T20:06:20Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @bnellnm.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

tests/kernels/moe/test_block_fp8.py

tests/kernels/moe/test_block_int8.py

tests/kernels/moe/test_deepep_moe.py

tests/kernels/moe/test_pplx_moe.py

tests/kernels/quant_utils.py

vllm/model_executor/layers/fused_moe/config.py

tlrmchlsmth · 2025-06-24T20:41:06Z

vllm/model_executor/layers/fused_moe/config.py

+class FusedMoEParallelConfig:
+    tp_size: int
+    dp_size: int
+    ep_size: int
+    tp_rank: int
+    dp_rank: int
+    ep_rank: int
+    world_size: int


I think the tp_size and the dp_size are actually properties of the Attn layers rather than the MoE layers?

This is potentially confusing as we will likely in the future have both TP+EP in the MoE layer itself. See #20037

Maybe we should break things down to have an AttnParallelConfig and a FusedMoE parallel config?

vllm/model_executor/layers/fused_moe/deepep_ll_prepare_finalize.py

vllm/model_executor/layers/fused_moe/fused_batched_moe.py

tlrmchlsmth · 2025-06-24T20:46:27Z

vllm/model_executor/layers/fused_moe/pplx_prepare_finalize.py

+            # per-tensor
+            # ?
+            hidden_scale_bytes = round_up(elem_size, align)


This must be per token + some tweaks for alignment issues -- @ElizaWszola could you fill in what this comment should say?

vllm/model_executor/layers/fused_moe/pplx_prepare_finalize.py

mergify · 2025-06-26T22:25:16Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @bnellnm.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Bill Nell <bnell@redhat.com>

huydhn · 2025-07-03T00:53:22Z

@bnellnm I start to see this error AttributeError: 'CompressedTensorsW8A8Fp8MoECutlassMethod' object has no attribute 'topk_indices_dtype' showing up after this is merged in c1909e7 when trying to serve meta-llama/llama-4-maverick-17b-128e-instruct-fp8. For example, https://github.com/pytorch/pytorch-integration-testing/actions/runs/16038022394/job/45253967329#step:14:2856. Any thoughts? I could create an issue for this if needed

cc @houseroad @yeqcharlotte

mgoin · 2025-07-03T00:57:27Z

@huydhn please see if this PR fixes your issue #20381

EDIT: sorry I meant to link #20166

minosfuture · 2025-07-03T21:50:00Z

@bnellnm I start to see this error AttributeError: 'CompressedTensorsW8A8Fp8MoECutlassMethod' object has no attribute 'topk_indices_dtype' showing up after this is merged in c1909e7 when trying to serve meta-llama/llama-4-maverick-17b-128e-instruct-fp8. For example, https://github.com/pytorch/pytorch-integration-testing/actions/runs/16038022394/job/45253967329#step:14:2856. Any thoughts? I could create an issue for this if needed

cc @houseroad

I rebased #20166 and fixed the same issue. Should be good now if you patch it.

huydhn · 2025-07-04T19:59:31Z

Just another note that this failure also manifests on ROCm https://github.com/pytorch/pytorch-integration-testing/actions/runs/16063072569/job/45332643703#step:14:13108 after 78fe775

luccafong · 2025-07-05T03:58:40Z

vllm/model_executor/layers/fused_moe/cutlass_moe.py

@@ -330,22 +355,18 @@ def cutlass_moe_fp8(
    Returns:
    - torch.Tensor: The fp16 output tensor after applying the MoE layer.
    """
-    per_act_token = a1_scale.numel() != 1 if a1_scale is not None else (


when we remove this, no default default value of per_act_token assigned, also not assigning correct value from the caller like in

vllm/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py

Lines 934 to 940 in 7e90870

return self.fused_experts(

x,

layer.w13_weight,

layer.w2_weight,

topk_weights,

topk_ids,

activation=activation,

, could you please fix it?

huydhn · 2025-07-07T17:15:21Z

@bnellnm I start to see this error AttributeError: 'CompressedTensorsW8A8Fp8MoECutlassMethod' object has no attribute 'topk_indices_dtype' showing up after this is merged in c1909e7 when trying to serve meta-llama/llama-4-maverick-17b-128e-instruct-fp8. For example, https://github.com/pytorch/pytorch-integration-testing/actions/runs/16038022394/job/45253967329#step:14:2856. Any thoughts? I could create an issue for this if needed

The issue has been fixed after #20509

Signed-off-by: Bill Nell <bnell@redhat.com> Signed-off-by: ElizaWszola <ewszola@redhat.com> Co-authored-by: ElizaWszola <ewszola@redhat.com>

gemini-code-assist bot reviewed Jun 14, 2025

View reviewed changes

bnellnm mentioned this pull request Jun 16, 2025

[Kernels][Bugfix] Use torch op for all kernels in FusedMoE forward. Add additional testing for cudagraphs. #19717

Merged

bnellnm force-pushed the moe-refactor branch from cb5c853 to aa541a9 Compare June 17, 2025 02:53

mergify bot added the ci/build label Jun 17, 2025

bnellnm force-pushed the moe-refactor branch from aa541a9 to 6b4e406 Compare June 18, 2025 21:26

bnellnm changed the title ~~Moe refactor~~ [Kernels] MoE refactor Jun 18, 2025

bnellnm marked this pull request as ready for review June 19, 2025 02:02

bnellnm requested review from tlrmchlsmth, WoosukKwon, mgoin and robertgshaw2-redhat as code owners June 19, 2025 02:02

mgoin reviewed Jun 19, 2025

View reviewed changes

vllm/model_executor/layers/fused_moe/fused_moe.py Outdated Show resolved Hide resolved

tlrmchlsmth reviewed Jun 24, 2025

View reviewed changes

tests/kernels/moe/test_batched_moe.py Outdated Show resolved Hide resolved

mergify bot added the needs-rebase label Jun 24, 2025

tlrmchlsmth reviewed Jun 24, 2025

View reviewed changes

bnellnm force-pushed the moe-refactor branch from ab834ae to e8088c6 Compare June 26, 2025 04:10

mergify bot removed the needs-rebase label Jun 26, 2025

bnellnm force-pushed the moe-refactor branch from e8088c6 to 2518865 Compare June 26, 2025 20:34

bnellnm mentioned this pull request Jun 26, 2025

[Kernel] Enable fp8 support for pplx and BatchedTritonExperts. #18864

Merged

mergify bot added the needs-rebase label Jun 26, 2025

bnellnm force-pushed the moe-refactor branch from 4e1e0c6 to 87b969c Compare June 27, 2025 01:56

mergify bot added performance Performance-related issues and removed needs-rebase labels Jun 27, 2025

bnellnm requested review from mgoin and tlrmchlsmth June 27, 2025 20:38

bnellnm added 14 commits July 2, 2025 02:27

fix merge

550cc3b

Signed-off-by: Bill Nell <bnell@redhat.com>

fix deepep ht tests

d466524

Signed-off-by: Bill Nell <bnell@redhat.com>

review comments, reduce test combinations, cleanup test code, etc.

525affc

Signed-off-by: Bill Nell <bnell@redhat.com>

some quantization tweaks

d2b6682

Signed-off-by: Bill Nell <bnell@redhat.com>

fix weight config

0972e75

Signed-off-by: Bill Nell <bnell@redhat.com>

fix comment

5b154fa

Signed-off-by: Bill Nell <bnell@redhat.com>

fix stupid bug

012af37

Signed-off-by: Bill Nell <bnell@redhat.com>

more fixes

9e17fb0

Signed-off-by: Bill Nell <bnell@redhat.com>

fix

d81a46b

Signed-off-by: Bill Nell <bnell@redhat.com>

fix lint

63837ad

Signed-off-by: Bill Nell <bnell@redhat.com>

fix LM Eval Small Models test failure

8d8ed0a

Signed-off-by: Bill Nell <bnell@redhat.com>

shut lint up for now

9a9b8e9

Signed-off-by: Bill Nell <bnell@redhat.com>

bump up int8 tolerance a tiny bit

e635a37

Signed-off-by: Bill Nell <bnell@redhat.com>

fix merge

db33d8f

Signed-off-by: Bill Nell <bnell@redhat.com>

auto-merge was automatically disabled July 2, 2025 02:31
Head branch was pushed to by a user without write access

bnellnm force-pushed the moe-refactor branch from 3fa4f53 to db33d8f Compare July 2, 2025 02:31

tlrmchlsmth enabled auto-merge (squash) July 2, 2025 02:52

fix messed up config setup

347a7b7

Signed-off-by: Bill Nell <bnell@redhat.com>

auto-merge was automatically disabled July 2, 2025 03:10
Head branch was pushed to by a user without write access

one more fix

86224d0

Signed-off-by: Bill Nell <bnell@redhat.com>

vllm-bot merged commit c1909e7 into vllm-project:main Jul 2, 2025
74 of 78 checks passed

luccafong reviewed Jul 5, 2025

View reviewed changes

minosfuture mentioned this pull request Jul 5, 2025

[Bugfix] Fix topk_ids indices_type for CUTLASS w8a8 FP8 MoE #20166

Merged

luccafong mentioned this pull request Jul 5, 2025

[Bugfix] Fix missing per_act_token parameter in compressed_tensors_moe #20509

Merged

huydhn pushed a commit to huydhn/vllm that referenced this pull request Jul 8, 2025

[Kernels] MoE refactor (vllm-project#19636)

a6b0a59

Signed-off-by: Bill Nell <bnell@redhat.com> Signed-off-by: ElizaWszola <ewszola@redhat.com> Co-authored-by: ElizaWszola <ewszola@redhat.com>

	return self.fused_experts(
	x,
	layer.w13_weight,
	layer.w2_weight,
	topk_weights,
	topk_ids,
	activation=activation,

Uh oh!

[Kernels] MoE refactor #19636

[Kernels] MoE refactor #19636

Uh oh!

Conversation

bnellnm commented Jun 14, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jun 14, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mergify bot commented Jun 24, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tlrmchlsmth Jun 24, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tlrmchlsmth Jun 24, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mergify bot commented Jun 26, 2025

Uh oh!

Uh oh!

huydhn commented Jul 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mgoin commented Jul 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

minosfuture commented Jul 3, 2025

Uh oh!

huydhn commented Jul 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

luccafong Jul 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

huydhn commented Jul 7, 2025

Uh oh!

Uh oh!

bnellnm commented Jun 14, 2025 •

edited by github-actions bot

Loading

huydhn commented Jul 3, 2025 •

edited

Loading

mgoin commented Jul 3, 2025 •

edited

Loading

huydhn commented Jul 4, 2025 •

edited

Loading

luccafong Jul 5, 2025 •

edited

Loading