-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
[Bugfix] Fix a couple PPLX+CUTLASS MoE bugs #20825
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bugfix] Fix a couple PPLX+CUTLASS MoE bugs #20825
Conversation
Signed-off-by: ElizaWszola <ewszola@redhat.com>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @ElizaWszola, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request addresses critical compatibility issues between PPLX (Parallel Processing Layer eXecutor) and CUTLASS Mixture-of-Experts (MoE) implementations. The changes primarily focus on resolving data type mismatches for topk_ids
and establishing a more robust mechanism for handling different MoE kernel implementations, ensuring seamless operation for FP8 MoE models within the PPLX framework.
Highlights
- Type Assertion for
topk_ids
: Ensuredtopk_ids
are correctly cast totorch.int32
for CUTLASS MoE operations andtorch.uint32
for PPLXprepare
andcombine
functions, resolving type assertion errors. - Flexible MoE Kernel Execution: Introduced a conditional execution path in
CompressedTensorsMoEMethod.apply
. It now checks if a modular kernel is explicitly provided; if not, it defaults to dynamically importing and utilizing thecutlass_moe_fp8
function, allowing for both modular and straightforward MoE kernel calls. - Input Scale Parameter Adjustment: Adjusted the
a1_scale
anda2_scale
parameters passed to the modular kernel path to uselayer.w13_input_scale
andlayer.w2_input_scale
respectively, aligning with the expected input scales for the modular kernel.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request fixes bugs for CUTLASS MoE with PPLX, including casting topk_ids
to the correct integer types and creating separate code paths for modular and non-modular kernels. The review identified a critical issue where torch.Tensor.view(dtype=...)
is used for type casting, which leads to corrupted indices. The suggestion is to use .to(dtype=...)
instead. A refactoring is also suggested to reduce code duplication in compressed_tensors_moe.py
.
@@ -320,6 +320,7 @@ def apply( | |||
|
|||
activation_callable = lambda o, i: self.activation(activation, o, i) | |||
|
|||
topk_ids = topk_ids.view(dtype=torch.int32) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using view(dtype=...)
to change the data type of a tensor is incorrect for casting numerical values. It reinterprets the underlying bytes of the tensor, which will lead to incorrect indices. To correctly cast the tensor while preserving its values, you should use the .to()
method.
topk_ids = topk_ids.to(dtype=torch.int32)
vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py
Show resolved
Hide resolved
Signed-off-by: ElizaWszola <ewszola@redhat.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes to pplx_prepare_finalize make sense. Could you explain the changes to compressed_tensors_moe.py? And please update the PR description with a thorough explanation of what's going on
@tlrmchlsmth I've updated the PR description to explain both fixes. |
self.topk_indices_dtype = None | ||
self.fused_experts = cutlass_moe_fp8 # type: ignore | ||
self.fused_experts = None # type: ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So how does this get set now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is set in init_prepare_finalize()
method in layer.py
:
self.fused_experts = FusedMoEModularKernel(
prepare_finalize,
experts,
)
This function is called for non-EP parallel runs. If it's never called, self.fused_experts
is never set and the condition in CompressedTensorsW8A8Fp8MoECutlassMethod
's apply()
function results in importing and calling cutlass_moe_fp8()
.
Before this PR, init_prepare_finalize()
would overwrite an existing cutlass_moe_fp8()
function and CompressedTensorsW8A8Fp8MoECutlassMethod
's apply()
would call whatever self.fused_experts
was at the time of the call. It was convenient to do so because cutlass_moe_fp8()
and FusedMoEModularKernel
's experts.apply()
were called with the same arguments. This changed in one of the recent PRs resulting in errors in PPLX runs, so now there's an if-else
condition required to decide which arguments self.fused_experts
should be called with.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should leave a comment for this tbh as it is difficult to know
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, and we should revisit this as well - we need to keep the control flow as simple as possible in the MoE layers given how complicated they are.
Fix two bugs that prevented CUTLASS MoE to run with PPLX:
topk_ids
to have the typeuint32
. Since we never expect elements oftopk_ids
to be negative when running with PPLX, this can be resolved with reinterpret casting the current signedint32
type to unsigned.compressed_tensors.py
either with PPLX or without it. If we run with PPLX, we end up runningCutlassExpertsFp8
's forward function which takes different arguments than a straightforward non-PPLXcutlass_moe_fp8
call. On the current main, this results in errors when running with PPLX. Fixing this issue also makes it easier to bring back pre-computed strides in PR20762.Testing:
PPLX run:
Non-PPLX run: run inference with
LLM(model="nm-testing/DeepSeek-Coder-V2-Lite-Instruct-FP8")