Skip to content

[Bugfix] Fix a couple PPLX+CUTLASS MoE bugs #20825

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jul 13, 2025

Conversation

ElizaWszola
Copy link
Contributor

@ElizaWszola ElizaWszola commented Jul 11, 2025

Fix two bugs that prevented CUTLASS MoE to run with PPLX:

  1. PPLX's All To All dispatch and combine functions require topk_ids to have the type uint32. Since we never expect elements of topk_ids to be negative when running with PPLX, this can be resolved with reinterpret casting the current signed int32 type to unsigned.
  2. CUTLASS MoE can be run from compressed_tensors.py either with PPLX or without it. If we run with PPLX, we end up running CutlassExpertsFp8's forward function which takes different arguments than a straightforward non-PPLX cutlass_moe_fp8 call. On the current main, this results in errors when running with PPLX. Fixing this issue also makes it easier to bring back pre-computed strides in PR20762.

Testing:

PPLX run:

export MASTER_ADDR=127.0.0.1
export MASTER_PORT=29500
export VLLM_ALL2ALL_BACKEND=pplx
python3 examples/offline_inference/data_parallel.py \
        --model="nm-testing/DeepSeek-Coder-V2-Lite-Instruct-FP8" \
        --dp-size=2 \
        --tp-size=1 \
        --trust-remote-code

Non-PPLX run: run inference with LLM(model="nm-testing/DeepSeek-Coder-V2-Lite-Instruct-FP8")

Signed-off-by: ElizaWszola <ewszola@redhat.com>
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @ElizaWszola, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses critical compatibility issues between PPLX (Parallel Processing Layer eXecutor) and CUTLASS Mixture-of-Experts (MoE) implementations. The changes primarily focus on resolving data type mismatches for topk_ids and establishing a more robust mechanism for handling different MoE kernel implementations, ensuring seamless operation for FP8 MoE models within the PPLX framework.

Highlights

  • Type Assertion for topk_ids: Ensured topk_ids are correctly cast to torch.int32 for CUTLASS MoE operations and torch.uint32 for PPLX prepare and combine functions, resolving type assertion errors.
  • Flexible MoE Kernel Execution: Introduced a conditional execution path in CompressedTensorsMoEMethod.apply. It now checks if a modular kernel is explicitly provided; if not, it defaults to dynamically importing and utilizing the cutlass_moe_fp8 function, allowing for both modular and straightforward MoE kernel calls.
  • Input Scale Parameter Adjustment: Adjusted the a1_scale and a2_scale parameters passed to the modular kernel path to use layer.w13_input_scale and layer.w2_input_scale respectively, aligning with the expected input scales for the modular kernel.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request fixes bugs for CUTLASS MoE with PPLX, including casting topk_ids to the correct integer types and creating separate code paths for modular and non-modular kernels. The review identified a critical issue where torch.Tensor.view(dtype=...) is used for type casting, which leads to corrupted indices. The suggestion is to use .to(dtype=...) instead. A refactoring is also suggested to reduce code duplication in compressed_tensors_moe.py.

@@ -320,6 +320,7 @@ def apply(

activation_callable = lambda o, i: self.activation(activation, o, i)

topk_ids = topk_ids.view(dtype=torch.int32)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Using view(dtype=...) to change the data type of a tensor is incorrect for casting numerical values. It reinterprets the underlying bytes of the tensor, which will lead to incorrect indices. To correctly cast the tensor while preserving its values, you should use the .to() method.

topk_ids = topk_ids.to(dtype=torch.int32)

Signed-off-by: ElizaWszola <ewszola@redhat.com>
Copy link
Collaborator

@tlrmchlsmth tlrmchlsmth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes to pplx_prepare_finalize make sense. Could you explain the changes to compressed_tensors_moe.py? And please update the PR description with a thorough explanation of what's going on

@ElizaWszola
Copy link
Contributor Author

@tlrmchlsmth I've updated the PR description to explain both fixes.

@ElizaWszola ElizaWszola changed the title [Bugfix] Fix a couple PPLX+CUTLASS bugs [Bugfix] Fix a couple PPLX+CUTLASS MoE bugs Jul 11, 2025
self.topk_indices_dtype = None
self.fused_experts = cutlass_moe_fp8 # type: ignore
self.fused_experts = None # type: ignore
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So how does this get set now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is set in init_prepare_finalize() method in layer.py:

self.fused_experts = FusedMoEModularKernel(
                prepare_finalize,
                experts,
            )

This function is called for non-EP parallel runs. If it's never called, self.fused_experts is never set and the condition in CompressedTensorsW8A8Fp8MoECutlassMethod's apply() function results in importing and calling cutlass_moe_fp8().

Before this PR, init_prepare_finalize() would overwrite an existing cutlass_moe_fp8() function and CompressedTensorsW8A8Fp8MoECutlassMethod's apply() would call whatever self.fused_experts was at the time of the call. It was convenient to do so because cutlass_moe_fp8() and FusedMoEModularKernel's experts.apply() were called with the same arguments. This changed in one of the recent PRs resulting in errors in PPLX runs, so now there's an if-else condition required to decide which arguments self.fused_experts should be called with.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should leave a comment for this tbh as it is difficult to know

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, and we should revisit this as well - we need to keep the control flow as simple as possible in the MoE layers given how complicated they are.

@mgoin mgoin added bug Something isn't working ready ONLY add when PR is ready to merge/full CI is needed labels Jul 12, 2025
@vllm-bot vllm-bot merged commit 3b3b778 into vllm-project:main Jul 13, 2025
81 of 83 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants