[Bug] Fix Compressed Tensor NVFP4 `cutlass_fp4_group_mm` illegal memory access #21465

yewentao256 · 2025-07-23T15:01:56Z

Purpose

Test

lm_eval --model vllm --model_args "pretrained=nm-testing/Qwen3-30B-A3B-NVFP4,max_model_len=32768,enable_expert_parallel=True,enforce_eager=True" --trust_remote_code --tasks gsm8k --num_fewshot 5 --batch_size auto

Origin:

  File "/home/wentao/vllm-source/vllm/model_executor/layers/fused_moe/layer.py", line 1489, in forward_impl
    final_hidden_states = self.quant_method.apply(
                          ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/wentao/vllm-source/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py", line 340, in apply
    return cutlass_moe_fp4(
           ^^^^^^^^^^^^^^^^
  File "/home/wentao/vllm-source/vllm/model_executor/layers/fused_moe/cutlass_moe.py", line 719, in cutlass_moe_fp4
    return fn(
           ^^^
  File "/home/wentao/.wentao_env/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/wentao/.wentao_env/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/wentao/vllm-source/vllm/model_executor/layers/fused_moe/modular_kernel.py", line 770, in forward
    fused_out = self._maybe_chunk_fused_experts(
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/wentao/vllm-source/vllm/model_executor/layers/fused_moe/modular_kernel.py", line 545, in _maybe_chunk_fused_experts
    return self._do_fused_experts(
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/wentao/vllm-source/vllm/model_executor/layers/fused_moe/modular_kernel.py", line 492, in _do_fused_experts
    self.fused_experts.apply(
  File "/home/wentao/vllm-source/vllm/model_executor/layers/fused_moe/cutlass_moe.py", line 640, in apply
    run_cutlass_moe_fp4(
  File "/home/wentao/vllm-source/vllm/model_executor/layers/fused_moe/cutlass_moe.py", line 527, in run_cutlass_moe_fp4
    ops.cutlass_fp4_moe_mm(c1, rep_a_fp4, w1_fp4, rep_a_blockscale,
  File "/home/wentao/vllm-source/vllm/_custom_ops.py", line 980, in cutlass_fp4_moe_mm
    return torch.ops._C.cutlass_fp4_group_mm(out_tensors, a_tensors, b_tensors,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/wentao/.wentao_env/lib/python3.12/site-packages/torch/_ops.py", line 1158, in __call__
    return self._op(*args, **(kwargs or {}))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: an illegal memory access was encountered

Now:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.887|±  |0.0087|
|     |       |strict-match    |     5|exact_match|↑  |0.884|±  |0.0088|

Signed-off-by: yewentao256 <zhyanwentao@126.com>

github-actions · 2025-07-23T15:02:05Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Code Review

This pull request addresses a critical "illegal memory access" bug in the NVFP4 MoE kernel. The root cause appears to be an optimization (SWAP_AB) that swaps matrix dimensions for performance, which is not supported on the FP4 path.

The fix is implemented in csrc/quantization/cutlass_w8a8/moe/moe_data.cu and involves:

Introducing a may_swap_ab flag that explicitly disables the SWAP_AB optimization when using the FP4 path (identified by the presence of blockscale_offsets).
Refactoring the CUDA kernels (compute_expert_offsets, compute_expert_blockscale_offsets) to accept this explicit boolean flag instead of inferring the logic from topk_length.

The changes are logical, well-contained, and directly address the reported crash. The refactoring also improves code clarity by making the dependency on the SWAP_AB optimization explicit. The fix appears correct and robust.

ElizaWszola · 2025-07-23T15:07:31Z

csrc/quantization/cutlass_w8a8/moe/moe_data.cu

+  bool may_swap_ab = (!blockscale_offsets.has_value()) &&
+                     (topk_ids.numel() <= SWAP_AB_THRESHOLD);


Would it make sense to rather add a boolean argument to get_cutlass_moe_mm_data() that forces no swap? Looks like disabling swap will be also needed for fp8 blockwise CUTLASS and it doesn't pass blockscale_offsets to this function

run_cutlass_moe_fp8 run_cutlass_block_scaled_fused_experts which path are you taking about?
I don't have enough context so I am thinking we can do that in following up pr

I mean a get_cutlass_moe_mm_data() call in run_cutlass_block_scaled_fused_experts() :) But I can add that change to a separate PR

mgoin · 2025-07-23T15:53:44Z

Can you check if this fails with modelopt fp4 as well since it should use the same kernel?

yewentao256 · 2025-07-23T21:00:14Z

Can you check if this fails with modelopt fp4 as well since it should use the same kernel?

nvidia/DeepSeek-R1-0528-FP4 This one will not cause the same issue

fix Compressed Tensor NVFP4 cutlass_fp4_group_mm illegal memory access

c2078ca

Signed-off-by: yewentao256 <zhyanwentao@126.com>

gemini-code-assist bot reviewed Jul 23, 2025

View reviewed changes

ElizaWszola reviewed Jul 23, 2025

View reviewed changes

mgoin added bug Something isn't working ready ONLY add when PR is ready to merge/full CI is needed labels Jul 23, 2025

nvpohanh mentioned this pull request Jul 24, 2025

[NVIDIA] Fix Llama4 Scout FP4 functionality issues #21499

Open

4 tasks

mgoin approved these changes Jul 24, 2025

View reviewed changes

mgoin enabled auto-merge (squash) July 24, 2025 14:22

vllm-bot merged commit e8cb0d0 into vllm-project:main Jul 24, 2025
109 of 111 checks passed

yewentao256 deleted the wye-fix-cutlass_fp4_group_mm-illegal-memory-access branch July 24, 2025 15:43

mgoin mentioned this pull request Jul 24, 2025

update flashinfer to v0.2.9rc1 #21485

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bug] Fix Compressed Tensor NVFP4 `cutlass_fp4_group_mm` illegal memory access #21465

[Bug] Fix Compressed Tensor NVFP4 `cutlass_fp4_group_mm` illegal memory access #21465

yewentao256 commented Jul 23, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Jul 23, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

ElizaWszola Jul 23, 2025

Uh oh!

yewentao256 Jul 23, 2025

Uh oh!

ElizaWszola Jul 24, 2025 •

edited

Loading

Uh oh!

mgoin commented Jul 23, 2025

Uh oh!

yewentao256 commented Jul 23, 2025

Uh oh!

Uh oh!

Uh oh!

		bool may_swap_ab = (!blockscale_offsets.has_value()) &&
		(topk_ids.numel() <= SWAP_AB_THRESHOLD);

Uh oh!

[Bug] Fix Compressed Tensor NVFP4 cutlass_fp4_group_mm illegal memory access #21465

[Bug] Fix Compressed Tensor NVFP4 cutlass_fp4_group_mm illegal memory access #21465

Conversation

yewentao256 commented Jul 23, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test

Uh oh!

github-actions bot commented Jul 23, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

ElizaWszola Jul 23, 2025

Choose a reason for hiding this comment

Uh oh!

yewentao256 Jul 23, 2025

Choose a reason for hiding this comment

Uh oh!

ElizaWszola Jul 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mgoin commented Jul 23, 2025

Uh oh!

yewentao256 commented Jul 23, 2025

Uh oh!

Uh oh!

Uh oh!

[Bug] Fix Compressed Tensor NVFP4 `cutlass_fp4_group_mm` illegal memory access #21465

[Bug] Fix Compressed Tensor NVFP4 `cutlass_fp4_group_mm` illegal memory access #21465

yewentao256 commented Jul 23, 2025 •

edited by github-actions bot

Loading

ElizaWszola Jul 24, 2025 •

edited

Loading