Skip to content

[BUG] Unspecified error when initializing mbar on H100 w/ CuTe Python DSL #2364

@axelfeldmann

Description

@axelfeldmann

Hi,

I am trying to initialize a memory barrier on an H100 in CuTe Python.

I have this quite short reproducer:

import torch
import cutlass
import cutlass.cute as cute

@cute.kernel
def barrier_kernel():
    tidx, _, _ = cute.arch.thread_idx()

    smem = cutlass.utils.SmemAllocator()
    mbar = smem.allocate_array(cutlass.Uint64, 1)

    if tidx == 0:
        cute.arch.mbarrier_init_tx_bytes(mbar, 1)
        cute.arch.mbarrier_init_fence()
        cute.printf(mbar)

    cute.arch.sync_threads()

@cute.jit
def launch():
    barrier_kernel().launch(grid=(1, 1, 1), block=(128, 1, 1), smem=1024)

cutlass.cuda.initialize_cuda_context()
cute.compile(launch)()
torch.cuda.synchronize()

This crashes with "unspecified launch failure". If I set mbarrier_init_arrive_cnt then it seems to work fine. However, with mbarrier_init_tx_bytes (that I plan to use for TMA), this crashes. It's very possible that I am doing something wrong, but the error messages are not helping me debug this sadly. Any help would be really appreciated :)

Metadata

Metadata

Assignees

No one assigned

    Labels

    CuTe DSLbugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions