Skip to content

DeepGEMM is not enabled on B200 when loading DeepSeek R1 #21472

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

smarterclayton
Copy link

@smarterclayton smarterclayton commented Jul 23, 2025

Purpose

While testing HEAD against DeepSeek-R1 on 2 B200 nodes (DP=16), despite VLLM_USE_DEEP_GEMM=1 I was receiving:

[fp8.py:520] DeepGemm not supported on the current platform.

After debugging, is_blackwell_deep_gemm_used can return false if no other deep_gemm.py utility method is invoked, because it does not explicitly invoke _lazy_init() to populate _per_block_cast_impl.

Update the function to always invoke _lazy_init(), consistent with other utility methods.

Test Plan

Verify startup on B200 in the environment.

Test Result

After this change was deployed, I see:

Using DeepGemm SM100 kernels for Fp8MoEMethod.

(Optional) Documentation Update

Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mergify mergify bot added the deepseek Related to DeepSeek models label Jul 23, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to fix an issue where DeepGEMM is not correctly detected on B200 GPUs due to a missing call to _lazy_init. The proposed change attempts to fix this but introduces new logical flaws and doesn't fully resolve the original issue. I've provided a suggestion to correct the logic to ensure _lazy_init is called under the right conditions, making the feature detection robust.

Comment on lines 24 to 27
if not (envs.VLLM_USE_DEEP_GEMM and has_deep_gemm()):
_lazy_init()
if _per_block_cast_impl is None:
return False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Thanks for fixing this initialization issue! However, the current logic seems to have a couple of flaws:

  1. If envs.VLLM_USE_DEEP_GEMM is False, the if condition on line 24 is met. The code then calls _lazy_init(), which may populate _per_block_cast_impl. If it does, the function will not return False here and will proceed to check the GPU properties, which is incorrect since the feature is disabled.
  2. If both envs.VLLM_USE_DEEP_GEMM and has_deep_gemm() are True, the if condition is False, so _lazy_init() is never called. This means _per_block_cast_impl will remain None, which is the original bug this PR aims to fix.

To correctly fix the initialization order dependency, we should first check the prerequisites, then call _lazy_init(), and finally check the GPU properties. This ensures _lazy_init() is always called when the feature is enabled and available, and not called otherwise.

Suggested change
if not (envs.VLLM_USE_DEEP_GEMM and has_deep_gemm()):
_lazy_init()
if _per_block_cast_impl is None:
return False
if not (envs.VLLM_USE_DEEP_GEMM and has_deep_gemm()):
return False
_lazy_init()
if _per_block_cast_impl is None:
return False

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Gemini is right here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@smarterclayton
Copy link
Author

@yewentao256 as I mentioned in Slack

@smarterclayton
Copy link
Author

smarterclayton commented Jul 23, 2025

However, after applying this fix, I get:

(EngineCore_3 pid=1043)   File "/app/vllm/vllm/model_executor/layers/fused_moe/batched_deep_gemm_moe.py", line 289, in apply
(EngineCore_3 pid=1043)     fp8_m_grouped_gemm_nt_masked((a1q, a1q_scale), (w1, w1_scale),
(EngineCore_3 pid=1043)   File "/app/vllm/vllm/utils/deep_gemm.py", line 106, in fp8_m_grouped_gemm_nt_masked
(EngineCore_3 pid=1043)     return _grouped_masked_impl(*args, **kwargs)
(EngineCore_3 pid=1043)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_3 pid=1043) RuntimeError: Failed: Assertion error csrc/jit_kernels/impls/smxx_layout.hpp:136 'mn % 4 == 0 and num_groups == 1'

My understanding was that DeepGEMM+DeepSeek-R1 on B200 should be working (and this PR assumes that was the case), but this failure suggests there is something missing (i'm on HEAD of DeepGEMM and HEAD of vLLM).

Copy link
Contributor

@yewentao256 yewentao256 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix! Just update through Gemini's suggestion and we are ready to merge

Comment on lines 24 to 27
if not (envs.VLLM_USE_DEEP_GEMM and has_deep_gemm()):
_lazy_init()
if _per_block_cast_impl is None:
return False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@yewentao256
Copy link
Contributor

However, after applying this fix, I get:

(EngineCore_3 pid=1043)   File "/app/vllm/vllm/model_executor/layers/fused_moe/batched_deep_gemm_moe.py", line 289, in apply
(EngineCore_3 pid=1043)     fp8_m_grouped_gemm_nt_masked((a1q, a1q_scale), (w1, w1_scale),
(EngineCore_3 pid=1043)   File "/app/vllm/vllm/utils/deep_gemm.py", line 106, in fp8_m_grouped_gemm_nt_masked
(EngineCore_3 pid=1043)     return _grouped_masked_impl(*args, **kwargs)
(EngineCore_3 pid=1043)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_3 pid=1043) RuntimeError: Failed: Assertion error csrc/jit_kernels/impls/smxx_layout.hpp:136 'mn % 4 == 0 and num_groups == 1'

My understanding was that DeepGEMM+DeepSeek-R1 on B200 should be working (and this PR assumes that was the case), but this failure suggests there is something missing (i'm on HEAD of DeepGEMM and HEAD of vLLM).

Thanks for letting me know, let me take a look

@smarterclayton smarterclayton force-pushed the lazy_init_blackwell_check branch 3 times, most recently from 4753086 to e59b90e Compare July 24, 2025 16:22
@smarterclayton
Copy link
Author

smarterclayton commented Jul 24, 2025

Further debugging showed that the original proposed fix was resulting in a CUDA initialization error when calling cuda_get_device_properties. We are in the process of debugging why, possibly due to the fork/spawn. In the meantime, I've updated this to a version that uses current_platform.is_cuda() and current_platform.cuda.is_device_capability(100) to avoid having a separate code path and was able to initialize without errors (although I'm still hitting the smxx_layout issue assertion).

Current CLI that triggers the failure on these GCP B200 A4 nodes with Ubuntu 24 and latest nvidia driver (IBGDA+nvidia_peermem + GCP recommended NCCL):

vllm serve Qwen/Qwen3-30B-A3B-FP8 --port 8000 --enforce-eager --disable-log-requests --enable-expert-parallel --tensor-parallel-size 1 --data-parallel-size 2  --trust-remote-code

@mgoin mgoin added ready ONLY add when PR is ready to merge/full CI is needed bug Something isn't working labels Jul 24, 2025
Copy link
Contributor

@yewentao256 yewentao256 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the work! Looks good to me

DeepGEMM on blackwell was being ignored because _lazy_init had
not been invoked yet. Update the function to require _lazy_init
no matter what caller.

As a consequence of this change, CUDA init was failing, likely
due to the fork/spawn early during initializition. Since
this is generic code, switch from invoking
cuda_get_device_properties to a platform check.

Signed-off-by: Clayton Coleman <smarterclayton@gmail.com>
@smarterclayton smarterclayton force-pushed the lazy_init_blackwell_check branch from e59b90e to 1c5feb9 Compare July 24, 2025 19:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working deepseek Related to DeepSeek models ready ONLY add when PR is ready to merge/full CI is needed
Projects
Status: Backlog
Development

Successfully merging this pull request may close these issues.

3 participants