-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Closed
Labels
Description
Hi,
I am trying to initialize a memory barrier on an H100 in CuTe Python.
I have this quite short reproducer:
import torch
import cutlass
import cutlass.cute as cute
@cute.kernel
def barrier_kernel():
tidx, _, _ = cute.arch.thread_idx()
smem = cutlass.utils.SmemAllocator()
mbar = smem.allocate_array(cutlass.Uint64, 1)
if tidx == 0:
cute.arch.mbarrier_init_tx_bytes(mbar, 1)
cute.arch.mbarrier_init_fence()
cute.printf(mbar)
cute.arch.sync_threads()
@cute.jit
def launch():
barrier_kernel().launch(grid=(1, 1, 1), block=(128, 1, 1), smem=1024)
cutlass.cuda.initialize_cuda_context()
cute.compile(launch)()
torch.cuda.synchronize()
This crashes with "unspecified launch failure". If I set mbarrier_init_arrive_cnt
then it seems to work fine. However, with mbarrier_init_tx_bytes
(that I plan to use for TMA), this crashes. It's very possible that I am doing something wrong, but the error messages are not helping me debug this sadly. Any help would be really appreciated :)