Skip to content

[BUG] NVFP4 Grouped GEMM fails on SM120a GeForce with cudaFuncSetAttribute error #2692

@voipmonitor

Description

@voipmonitor

Which component has the problem?

CuTe DSL

Bug Report

Environment

  • GPU: NVIDIA RTX PRO 6000 Blackwell Server Edition
  • Compute Capability: 12.0 (SM120a GeForce variant)
  • CUDA Version: 13.0.88
  • Driver Version: 580.82.09
  • CUTLASS Version: v4.2.1
  • OS: Linux 6.14.0-33-generic

GPU Properties

Shared Memory per Block (default): 48 KB (49152 bytes)
Shared Memory per Block (opt-in): 99 KB (101376 bytes)

Problem Description

The CUTLASS example 79d_blackwell_geforce_nvfp4_grouped_gemm fails on SM120a GeForce GPU with Status::kErrorInternal during initialize(), even though the kernel's shared memory
requirement (85 KB) is well below the GPU's maximum (99 KB).

Reproduction Steps

  1. Compile the example:
    cd examples/79_blackwell_geforce_gemm
    nvcc -std=c++17 -arch=sm_121
    -I../../include
    -I../../tools/util/include
    -I../common
    79d_blackwell_geforce_nvfp4_grouped_gemm.cu -o 79d_test

  2. Run with test dimensions:
    ./79d_test --m=1024 --n=768 --k=5120 --iterations=1

Observed Behavior

DEBUG: Kernel shared memory info:
CollectiveMainloop SharedStorage size: 74752 bytes
CollectiveEpilogue SharedStorage size: 11264 bytes
Total kernel SharedStorage size: 87040 bytes (85 KB)

DEBUG: GPU max shared memory: 101376 bytes (99 KB)

DEBUG: can_implement() returned: Success (0)
DEBUG: maximum_active_blocks() returned: -1
DEBUG: initialize() returned: Error Internal (7)

Root Cause

The failure occurs in gemm_universal_adapter.h at line 337-344:

cudaError_t result = cudaFuncSetAttribute(
device_kernel,
cudaFuncAttributeMaxDynamicSharedMemorySize,
smem_size); // 87040 bytes
if (cudaSuccess != result) {
return Status::kErrorInternal;
}

The cudaFuncSetAttribute call fails even though:

  • Kernel requires: 85 KB (87040 bytes)
  • GPU supports: 99 KB (101376 bytes)
  • 85 KB < 99 KB ✅

Analysis

  1. can_implement() succeeds - CUTLASS believes the kernel is supported
  2. maximum_active_blocks() returns -1 - cudaFuncSetAttribute fails internally
  3. cudaGetLastError() returns cudaSuccess - the error is cleared by CUTLASS before it can be observed
  4. The kernel uses 128×128×128 tile size with StageCountAutoCarveout

Expected Behavior

The kernel should initialize successfully and execute on SM120a GeForce, as the shared memory requirement is within hardware limits.

Questions

  1. Is there a hidden limitation on SM120a GeForce (e.g., register usage, TMA descriptor limits) that prevents this kernel configuration?
  2. Should can_implement() detect this issue and return kErrorNotSupported?
  3. Is NVFP4 blockscaled grouped GEMM only supported on SM100 datacenter (B100/B200) and not on SM120a GeForce variants?

Additional Notes

  • The same issue occurs with both Cooperative and Pingpong kernel schedules
  • Compiling with -arch=sm_120 gives "Arch conditional MMA instruction" errors, suggesting SM121 is required
  • With -arch=sm_121, compilation succeeds but runtime initialization fails

Is this a known limitation of SM120a GeForce, or is there a configuration change needed to support this kernel on GeForce hardware?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions