You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
The CuTe tutorial examples (e.g.https://github.com/NVIDIA/cutlass/blob/main/examples/cute/tutorial/blackwell/01_mma_sm100.cu#L226) define a TMEM-backed tensor which is used to store the output of GEMM, but never actually allocate TMEM space using tcgen05.alloc and related instructions. I believe that there's a possibility of a race condition between multiple CTAs that occupy the same SM and end up writing to the same TMEM space.
(I don't have access to Blackwell so I can't confirm this myself; for what it's worth, my coworker who has Blackwell access tested that kernel on large problem sizes and didn't see validation errors.)
The text was updated successfully, but these errors were encountered:
Are there any CUDA tools to help discover potential issues like this with TMEM races? OR this is entirely programmer responsibility with no tools to stamp that a Blackwell kernel is race-free?
Describe the bug
The CuTe tutorial examples (e.g.https://github.com/NVIDIA/cutlass/blob/main/examples/cute/tutorial/blackwell/01_mma_sm100.cu#L226) define a TMEM-backed tensor which is used to store the output of GEMM, but never actually allocate TMEM space using
tcgen05.alloc
and related instructions. I believe that there's a possibility of a race condition between multiple CTAs that occupy the same SM and end up writing to the same TMEM space.(I don't have access to Blackwell so I can't confirm this myself; for what it's worth, my coworker who has Blackwell access tested that kernel on large problem sizes and didn't see validation errors.)
The text was updated successfully, but these errors were encountered: