Skip to content

Post-training suffering from CUDA error #10

@youthHan

Description

@youthHan

Thanks for releasing the training codes and pipeline. While I'm trying to reproduce the libero-long results, I encountered CUDA errors and had to wrap with sdpa_kernel(SDPBackend.EFFICIENT_ATTENTION): on

attn_output = F.scaled_dot_product_attention(

However, this will greatly slow down the training, for 80000 steps, it currently needs 21 days to finish.

I tried different dockers (including the one in this repo), cu124+torch2.6 and cu126+torch2.7. All these trials result in CUDA errors. Could anyone that successfully starts training share there libs and versions?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions