[Bugfix] Fix a couple PPLX+CUTLASS MoE bugs #20825

ElizaWszola · 2025-07-11T15:40:14Z

Fix two bugs that prevented CUTLASS MoE to run with PPLX:

PPLX's All To All dispatch and combine functions require topk_ids to have the type uint32. Since we never expect elements of topk_ids to be negative when running with PPLX, this can be resolved with reinterpret casting the current signed int32 type to unsigned.
CUTLASS MoE can be run from compressed_tensors.py either with PPLX or without it. If we run with PPLX, we end up running CutlassExpertsFp8's forward function which takes different arguments than a straightforward non-PPLX cutlass_moe_fp8 call. On the current main, this results in errors when running with PPLX. Fixing this issue also makes it easier to bring back pre-computed strides in PR20762.

Testing:

PPLX run:

export MASTER_ADDR=127.0.0.1
export MASTER_PORT=29500
export VLLM_ALL2ALL_BACKEND=pplx
python3 examples/offline_inference/data_parallel.py \
        --model="nm-testing/DeepSeek-Coder-V2-Lite-Instruct-FP8" \
        --dp-size=2 \
        --tp-size=1 \
        --trust-remote-code

Non-PPLX run: run inference with LLM(model="nm-testing/DeepSeek-Coder-V2-Lite-Instruct-FP8")

Signed-off-by: ElizaWszola <ewszola@redhat.com>

github-actions · 2025-07-11T15:40:22Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Summary of Changes

Hello @ElizaWszola, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses critical compatibility issues between PPLX (Parallel Processing Layer eXecutor) and CUTLASS Mixture-of-Experts (MoE) implementations. The changes primarily focus on resolving data type mismatches for topk_ids and establishing a more robust mechanism for handling different MoE kernel implementations, ensuring seamless operation for FP8 MoE models within the PPLX framework.

Highlights

Type Assertion for topk_ids: Ensured topk_ids are correctly cast to torch.int32 for CUTLASS MoE operations and torch.uint32 for PPLX prepare and combine functions, resolving type assertion errors.
Flexible MoE Kernel Execution: Introduced a conditional execution path in CompressedTensorsMoEMethod.apply. It now checks if a modular kernel is explicitly provided; if not, it defaults to dynamically importing and utilizing the cutlass_moe_fp8 function, allowing for both modular and straightforward MoE kernel calls.
Input Scale Parameter Adjustment: Adjusted the a1_scale and a2_scale parameters passed to the modular kernel path to use layer.w13_input_scale and layer.w2_input_scale respectively, aligning with the expected input scales for the modular kernel.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request fixes bugs for CUTLASS MoE with PPLX, including casting topk_ids to the correct integer types and creating separate code paths for modular and non-modular kernels. The review identified a critical issue where torch.Tensor.view(dtype=...) is used for type casting, which leads to corrupted indices. The suggestion is to use .to(dtype=...) instead. A refactoring is also suggested to reduce code duplication in compressed_tensors_moe.py.

gemini-code-assist · 2025-07-11T15:42:15Z

vllm/model_executor/layers/fused_moe/cutlass_moe.py

@@ -320,6 +320,7 @@ def apply(

        activation_callable = lambda o, i: self.activation(activation, o, i)

+        topk_ids = topk_ids.view(dtype=torch.int32)


Using view(dtype=...) to change the data type of a tensor is incorrect for casting numerical values. It reinterprets the underlying bytes of the tensor, which will lead to incorrect indices. To correctly cast the tensor while preserving its values, you should use the .to() method.

topk_ids = topk_ids.to(dtype=torch.int32)

vllm/model_executor/layers/fused_moe/pplx_prepare_finalize.py

vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py

Signed-off-by: ElizaWszola <ewszola@redhat.com>

tlrmchlsmth

The changes to pplx_prepare_finalize make sense. Could you explain the changes to compressed_tensors_moe.py? And please update the PR description with a thorough explanation of what's going on

ElizaWszola · 2025-07-11T16:05:50Z

@tlrmchlsmth I've updated the PR description to explain both fixes.

tlrmchlsmth · 2025-07-11T21:27:14Z

vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py

        self.topk_indices_dtype = None
-        self.fused_experts = cutlass_moe_fp8  # type: ignore
+        self.fused_experts = None  # type: ignore


So how does this get set now?

It is set in init_prepare_finalize() method in layer.py:

self.fused_experts = FusedMoEModularKernel( prepare_finalize, experts, )

This function is called for non-EP parallel runs. If it's never called, self.fused_experts is never set and the condition in CompressedTensorsW8A8Fp8MoECutlassMethod's apply() function results in importing and calling cutlass_moe_fp8().

Before this PR, init_prepare_finalize() would overwrite an existing cutlass_moe_fp8() function and CompressedTensorsW8A8Fp8MoECutlassMethod's apply() would call whatever self.fused_experts was at the time of the call. It was convenient to do so because cutlass_moe_fp8() and FusedMoEModularKernel's experts.apply() were called with the same arguments. This changed in one of the recent PRs resulting in errors in PPLX runs, so now there's an if-else condition required to decide which arguments self.fused_experts should be called with.

We should leave a comment for this tbh as it is difficult to know

+1, and we should revisit this as well - we need to keep the control flow as simple as possible in the MoE layers given how complicated they are.

Fix a couple PPLX+CUTLASS bugs

4aa8814

Signed-off-by: ElizaWszola <ewszola@redhat.com>

ElizaWszola requested review from mgoin, robertgshaw2-redhat and tlrmchlsmth as code owners July 11, 2025 15:40

gemini-code-assist bot reviewed Jul 11, 2025

View reviewed changes

Remove redundant typecast

19e33fd

Signed-off-by: ElizaWszola <ewszola@redhat.com>

tlrmchlsmth reviewed Jul 11, 2025

View reviewed changes

ElizaWszola changed the title ~~[Bugfix] Fix a couple PPLX+CUTLASS bugs~~ [Bugfix] Fix a couple PPLX+CUTLASS MoE bugs Jul 11, 2025

tlrmchlsmth reviewed Jul 11, 2025

View reviewed changes

mgoin added bug Something isn't working ready ONLY add when PR is ready to merge/full CI is needed labels Jul 12, 2025

tlrmchlsmth mentioned this pull request Jul 12, 2025

[Bugfix] Fix topk_ids indices_type for CUTLASS w8a8 FP8 MoE #20166

Merged

tlrmchlsmth approved these changes Jul 12, 2025

View reviewed changes

vllm-bot merged commit 3b3b778 into vllm-project:main Jul 13, 2025
81 of 83 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bugfix] Fix a couple PPLX+CUTLASS MoE bugs #20825

[Bugfix] Fix a couple PPLX+CUTLASS MoE bugs #20825

ElizaWszola commented Jul 11, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Jul 11, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jul 11, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tlrmchlsmth left a comment

Uh oh!

ElizaWszola commented Jul 11, 2025

Uh oh!

tlrmchlsmth Jul 11, 2025

Uh oh!

ElizaWszola Jul 12, 2025

Uh oh!

mgoin Jul 12, 2025

Uh oh!

tlrmchlsmth Jul 12, 2025

Uh oh!

Uh oh!

Uh oh!

		@@ -320,6 +320,7 @@ def apply(

		activation_callable = lambda o, i: self.activation(activation, o, i)

		topk_ids = topk_ids.view(dtype=torch.int32)

Uh oh!

[Bugfix] Fix a couple PPLX+CUTLASS MoE bugs #20825

[Bugfix] Fix a couple PPLX+CUTLASS MoE bugs #20825

Conversation

ElizaWszola commented Jul 11, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Testing:

Uh oh!

github-actions bot commented Jul 11, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jul 11, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tlrmchlsmth left a comment

Choose a reason for hiding this comment

Uh oh!

ElizaWszola commented Jul 11, 2025

Uh oh!

tlrmchlsmth Jul 11, 2025

Choose a reason for hiding this comment

Uh oh!

ElizaWszola Jul 12, 2025

Choose a reason for hiding this comment

Uh oh!

mgoin Jul 12, 2025

Choose a reason for hiding this comment

Uh oh!

tlrmchlsmth Jul 12, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ElizaWszola commented Jul 11, 2025 •

edited by github-actions bot

Loading