-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Description
Which component has the problem?
CuTe DSL
Bug Report
Environment
- GPU: NVIDIA RTX PRO 6000 Blackwell Server Edition
- Compute Capability: 12.0 (SM120a GeForce variant)
- CUDA Version: 13.0.88
- Driver Version: 580.82.09
- CUTLASS Version: v4.2.1
- OS: Linux 6.14.0-33-generic
GPU Properties
Shared Memory per Block (default): 48 KB (49152 bytes)
Shared Memory per Block (opt-in): 99 KB (101376 bytes)
Problem Description
The CUTLASS example 79d_blackwell_geforce_nvfp4_grouped_gemm fails on SM120a GeForce GPU with Status::kErrorInternal during initialize(), even though the kernel's shared memory
requirement (85 KB) is well below the GPU's maximum (99 KB).
Reproduction Steps
-
Compile the example:
cd examples/79_blackwell_geforce_gemm
nvcc -std=c++17 -arch=sm_121
-I../../include
-I../../tools/util/include
-I../common
79d_blackwell_geforce_nvfp4_grouped_gemm.cu -o 79d_test -
Run with test dimensions:
./79d_test --m=1024 --n=768 --k=5120 --iterations=1
Observed Behavior
DEBUG: Kernel shared memory info:
CollectiveMainloop SharedStorage size: 74752 bytes
CollectiveEpilogue SharedStorage size: 11264 bytes
Total kernel SharedStorage size: 87040 bytes (85 KB)
DEBUG: GPU max shared memory: 101376 bytes (99 KB)
DEBUG: can_implement() returned: Success (0)
DEBUG: maximum_active_blocks() returned: -1
DEBUG: initialize() returned: Error Internal (7)
Root Cause
The failure occurs in gemm_universal_adapter.h at line 337-344:
cudaError_t result = cudaFuncSetAttribute(
device_kernel,
cudaFuncAttributeMaxDynamicSharedMemorySize,
smem_size); // 87040 bytes
if (cudaSuccess != result) {
return Status::kErrorInternal;
}
The cudaFuncSetAttribute call fails even though:
- Kernel requires: 85 KB (87040 bytes)
- GPU supports: 99 KB (101376 bytes)
- 85 KB < 99 KB ✅
Analysis
- can_implement() succeeds - CUTLASS believes the kernel is supported
- maximum_active_blocks() returns -1 - cudaFuncSetAttribute fails internally
- cudaGetLastError() returns cudaSuccess - the error is cleared by CUTLASS before it can be observed
- The kernel uses 128×128×128 tile size with StageCountAutoCarveout
Expected Behavior
The kernel should initialize successfully and execute on SM120a GeForce, as the shared memory requirement is within hardware limits.
Questions
- Is there a hidden limitation on SM120a GeForce (e.g., register usage, TMA descriptor limits) that prevents this kernel configuration?
- Should can_implement() detect this issue and return kErrorNotSupported?
- Is NVFP4 blockscaled grouped GEMM only supported on SM100 datacenter (B100/B200) and not on SM120a GeForce variants?
Additional Notes
- The same issue occurs with both Cooperative and Pingpong kernel schedules
- Compiling with -arch=sm_120 gives "Arch conditional MMA instruction" errors, suggesting SM121 is required
- With -arch=sm_121, compilation succeeds but runtime initialization fails
Is this a known limitation of SM120a GeForce, or is there a configuration change needed to support this kernel on GeForce hardware?