[BUG] NVFP4 Grouped GEMM fails on SM120a GeForce with cudaFuncSetAttribute error

### Which component has the problem?

CuTe DSL

### Bug Report

  Environment

  - GPU: NVIDIA RTX PRO 6000 Blackwell Server Edition
  - Compute Capability: 12.0 (SM120a GeForce variant)
  - CUDA Version: 13.0.88
  - Driver Version: 580.82.09
  - CUTLASS Version: v4.2.1
  - OS: Linux 6.14.0-33-generic

  GPU Properties

  Shared Memory per Block (default): 48 KB (49152 bytes)
  Shared Memory per Block (opt-in): 99 KB (101376 bytes)

  Problem Description

  The CUTLASS example 79d_blackwell_geforce_nvfp4_grouped_gemm fails on SM120a GeForce GPU with Status::kErrorInternal during initialize(), even though the kernel's shared memory
  requirement (85 KB) is well below the GPU's maximum (99 KB).

  Reproduction Steps

  1. Compile the example:
  cd examples/79_blackwell_geforce_gemm
  nvcc -std=c++17 -arch=sm_121 \
    -I../../include \
    -I../../tools/util/include \
    -I../common \
    79d_blackwell_geforce_nvfp4_grouped_gemm.cu -o 79d_test

  2. Run with test dimensions:
  ./79d_test --m=1024 --n=768 --k=5120 --iterations=1

  Observed Behavior

  DEBUG: Kernel shared memory info:
    CollectiveMainloop SharedStorage size: 74752 bytes
    CollectiveEpilogue SharedStorage size: 11264 bytes
    Total kernel SharedStorage size: 87040 bytes (85 KB)
  
  DEBUG: GPU max shared memory: 101376 bytes (99 KB)
  
  DEBUG: can_implement() returned: Success (0)
  DEBUG: maximum_active_blocks() returned: -1
  DEBUG: initialize() returned: Error Internal (7)
  
 Root Cause

  The failure occurs in gemm_universal_adapter.h at line 337-344:

  cudaError_t result = cudaFuncSetAttribute(
      device_kernel<GemmKernel>,
      cudaFuncAttributeMaxDynamicSharedMemorySize,
      smem_size);  // 87040 bytes
  if (cudaSuccess != result) {
      return Status::kErrorInternal;
  }

  The cudaFuncSetAttribute call fails even though:
  - Kernel requires: 85 KB (87040 bytes)
  - GPU supports: 99 KB (101376 bytes)
  - 85 KB < 99 KB ✅

  Analysis

  1. can_implement() succeeds - CUTLASS believes the kernel is supported
  2. maximum_active_blocks() returns -1 - cudaFuncSetAttribute fails internally
  3. cudaGetLastError() returns cudaSuccess - the error is cleared by CUTLASS before it can be observed
  4. The kernel uses 128×128×128 tile size with StageCountAutoCarveout

  Expected Behavior

  The kernel should initialize successfully and execute on SM120a GeForce, as the shared memory requirement is within hardware limits.

  Questions

  1. Is there a hidden limitation on SM120a GeForce (e.g., register usage, TMA descriptor limits) that prevents this kernel configuration?
  2. Should can_implement() detect this issue and return kErrorNotSupported?
  3. Is NVFP4 blockscaled grouped GEMM only supported on SM100 datacenter (B100/B200) and not on SM120a GeForce variants?

  Additional Notes

  - The same issue occurs with both Cooperative and Pingpong kernel schedules
  - Compiling with -arch=sm_120 gives "Arch conditional MMA instruction" errors, suggesting SM121 is required
  - With -arch=sm_121, compilation succeeds but runtime initialization fails

  Is this a known limitation of SM120a GeForce, or is there a configuration change needed to support this kernel on GeForce hardware?




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] NVFP4 Grouped GEMM fails on SM120a GeForce with cudaFuncSetAttribute error #2692

Which component has the problem?

Bug Report

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] NVFP4 Grouped GEMM fails on SM120a GeForce with cudaFuncSetAttribute error #2692

Description

Which component has the problem?

Bug Report

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions