Hi, cutlass team!
I'm trying example 77_blackwell_fmha which works fine on B200 GPUs but not on RTX5090 GPUs which are also Blackwell architecture, my compilation instructions are as follows:
nvcc -arch=sm_120a -I./include -I./tools/util/include -I./examples/77_blackwell_fmha -std=c++17 --expt-relaxed-constexpr \ ./examples/77_blackwell_fmha/77_blackwell_fmha_bwd.cu -o blackwell_fmha_bwd
The error is reported as follows:
###### B 16 H 16 Q 1024 K 1024 D 128 Backward Full #SM 170 [ ERROR: CUDA Runtime ] ./include/cutlass/cluster_launch.hpp:248: invalid argument Failed to launch the CUTLASS kernel. Last CUDA error is: no error [FAIL] tma : 0 TFLOPS/s
I would like to know why the above error is reported? Is 77_blackwell_fmha theoretically able to run on RTX5090?
Or is it only optimized specifically for the B200.
Sincerely thank you!