Skip to content

[QST]Why does the 77_blackwell_fmha example not work on Blackwell architecture RTX5090 GPUs? #2513

@luhan-wang

Description

@luhan-wang

Hi, cutlass team!

I'm trying example 77_blackwell_fmha which works fine on B200 GPUs but not on RTX5090 GPUs which are also Blackwell architecture, my compilation instructions are as follows:

nvcc -arch=sm_120a -I./include -I./tools/util/include -I./examples/77_blackwell_fmha -std=c++17 --expt-relaxed-constexpr \ ./examples/77_blackwell_fmha/77_blackwell_fmha_bwd.cu -o blackwell_fmha_bwd

The error is reported as follows:

###### B 16 H 16 Q 1024 K 1024 D 128 Backward Full #SM 170 [ ERROR: CUDA Runtime ] ./include/cutlass/cluster_launch.hpp:248: invalid argument Failed to launch the CUTLASS kernel. Last CUDA error is: no error [FAIL] tma : 0 TFLOPS/s

I would like to know why the above error is reported? Is 77_blackwell_fmha theoretically able to run on RTX5090?
Or is it only optimized specifically for the B200.

Sincerely thank you!

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions