Skip to content

[Bug] Fix Compressed Tensor NVFP4 cutlass_fp4_group_mm illegal memory access #21465

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

yewentao256
Copy link
Contributor

@yewentao256 yewentao256 commented Jul 23, 2025

Purpose

Fixes #21399

Test

lm_eval --model vllm --model_args "pretrained=nm-testing/Qwen3-30B-A3B-NVFP4,max_model_len=32768,enable_expert_parallel=True,enforce_eager=True" --trust_remote_code --tasks gsm8k --num_fewshot 5 --batch_size auto

Origin:

  File "/home/wentao/vllm-source/vllm/model_executor/layers/fused_moe/layer.py", line 1489, in forward_impl
    final_hidden_states = self.quant_method.apply(
                          ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/wentao/vllm-source/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py", line 340, in apply
    return cutlass_moe_fp4(
           ^^^^^^^^^^^^^^^^
  File "/home/wentao/vllm-source/vllm/model_executor/layers/fused_moe/cutlass_moe.py", line 719, in cutlass_moe_fp4
    return fn(
           ^^^
  File "/home/wentao/.wentao_env/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/wentao/.wentao_env/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/wentao/vllm-source/vllm/model_executor/layers/fused_moe/modular_kernel.py", line 770, in forward
    fused_out = self._maybe_chunk_fused_experts(
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/wentao/vllm-source/vllm/model_executor/layers/fused_moe/modular_kernel.py", line 545, in _maybe_chunk_fused_experts
    return self._do_fused_experts(
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/wentao/vllm-source/vllm/model_executor/layers/fused_moe/modular_kernel.py", line 492, in _do_fused_experts
    self.fused_experts.apply(
  File "/home/wentao/vllm-source/vllm/model_executor/layers/fused_moe/cutlass_moe.py", line 640, in apply
    run_cutlass_moe_fp4(
  File "/home/wentao/vllm-source/vllm/model_executor/layers/fused_moe/cutlass_moe.py", line 527, in run_cutlass_moe_fp4
    ops.cutlass_fp4_moe_mm(c1, rep_a_fp4, w1_fp4, rep_a_blockscale,
  File "/home/wentao/vllm-source/vllm/_custom_ops.py", line 980, in cutlass_fp4_moe_mm
    return torch.ops._C.cutlass_fp4_group_mm(out_tensors, a_tensors, b_tensors,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/wentao/.wentao_env/lib/python3.12/site-packages/torch/_ops.py", line 1158, in __call__
    return self._op(*args, **(kwargs or {}))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: an illegal memory access was encountered

Now:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match||0.887|±  |0.0087|
|     |       |strict-match    |     5|exact_match||0.884|±  |0.0088|

Signed-off-by: yewentao256 <zhyanwentao@126.com>
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a critical "illegal memory access" bug in the NVFP4 MoE kernel. The root cause appears to be an optimization (SWAP_AB) that swaps matrix dimensions for performance, which is not supported on the FP4 path.

The fix is implemented in csrc/quantization/cutlass_w8a8/moe/moe_data.cu and involves:

  1. Introducing a may_swap_ab flag that explicitly disables the SWAP_AB optimization when using the FP4 path (identified by the presence of blockscale_offsets).
  2. Refactoring the CUDA kernels (compute_expert_offsets, compute_expert_blockscale_offsets) to accept this explicit boolean flag instead of inferring the logic from topk_length.

The changes are logical, well-contained, and directly address the reported crash. The refactoring also improves code clarity by making the dependency on the SWAP_AB optimization explicit. The fix appears correct and robust.

Comment on lines +121 to +122
bool may_swap_ab = (!blockscale_offsets.has_value()) &&
(topk_ids.numel() <= SWAP_AB_THRESHOLD);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to rather add a boolean argument to get_cutlass_moe_mm_data() that forces no swap? Looks like disabling swap will be also needed for fp8 blockwise CUTLASS and it doesn't pass blockscale_offsets to this function

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

run_cutlass_moe_fp8 run_cutlass_block_scaled_fused_experts which path are you taking about?
I don't have enough context so I am thinking we can do that in following up pr

Copy link
Contributor

@ElizaWszola ElizaWszola Jul 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean a get_cutlass_moe_mm_data() call in run_cutlass_block_scaled_fused_experts() :) But I can add that change to a separate PR

@mgoin
Copy link
Member

mgoin commented Jul 23, 2025

Can you check if this fails with modelopt fp4 as well since it should use the same kernel?

@yewentao256
Copy link
Contributor Author

Can you check if this fails with modelopt fp4 as well since it should use the same kernel?

nvidia/DeepSeek-R1-0528-FP4 This one will not cause the same issue

@mgoin mgoin added bug Something isn't working ready ONLY add when PR is ready to merge/full CI is needed labels Jul 23, 2025
@mgoin mgoin enabled auto-merge (squash) July 24, 2025 14:22
@vllm-bot vllm-bot merged commit e8cb0d0 into vllm-project:main Jul 24, 2025
109 of 111 checks passed
@yewentao256 yewentao256 deleted the wye-fix-cutlass_fp4_group_mm-illegal-memory-access branch July 24, 2025 15:43
@mgoin mgoin mentioned this pull request Jul 24, 2025
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug]: Compressed Tensor NVFP4 cutlass_fp4_group_mm illegal memory access
4 participants