-
-
Notifications
You must be signed in to change notification settings - Fork 8.9k
[Bug] Fix Compressed Tensor NVFP4 cutlass_fp4_group_mm
illegal memory access
#21465
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] Fix Compressed Tensor NVFP4 cutlass_fp4_group_mm
illegal memory access
#21465
Conversation
Signed-off-by: yewentao256 <zhyanwentao@126.com>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request addresses a critical "illegal memory access" bug in the NVFP4 MoE kernel. The root cause appears to be an optimization (SWAP_AB
) that swaps matrix dimensions for performance, which is not supported on the FP4 path.
The fix is implemented in csrc/quantization/cutlass_w8a8/moe/moe_data.cu
and involves:
- Introducing a
may_swap_ab
flag that explicitly disables theSWAP_AB
optimization when using the FP4 path (identified by the presence ofblockscale_offsets
). - Refactoring the CUDA kernels (
compute_expert_offsets
,compute_expert_blockscale_offsets
) to accept this explicit boolean flag instead of inferring the logic fromtopk_length
.
The changes are logical, well-contained, and directly address the reported crash. The refactoring also improves code clarity by making the dependency on the SWAP_AB
optimization explicit. The fix appears correct and robust.
bool may_swap_ab = (!blockscale_offsets.has_value()) && | ||
(topk_ids.numel() <= SWAP_AB_THRESHOLD); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to rather add a boolean argument to get_cutlass_moe_mm_data()
that forces no swap? Looks like disabling swap will be also needed for fp8 blockwise CUTLASS and it doesn't pass blockscale_offsets
to this function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
run_cutlass_moe_fp8
run_cutlass_block_scaled_fused_experts
which path are you taking about?
I don't have enough context so I am thinking we can do that in following up pr
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean a get_cutlass_moe_mm_data()
call in run_cutlass_block_scaled_fused_experts()
:) But I can add that change to a separate PR
Can you check if this fails with modelopt fp4 as well since it should use the same kernel? |
|
Purpose
Fixes #21399
Test
lm_eval --model vllm --model_args "pretrained=nm-testing/Qwen3-30B-A3B-NVFP4,max_model_len=32768,enable_expert_parallel=True,enforce_eager=True" --trust_remote_code --tasks gsm8k --num_fewshot 5 --batch_size auto
Origin:
Now: