Post-training suffering from CUDA error

Thanks for releasing the training codes and pipeline. While I'm trying to reproduce the libero-long results, I encountered CUDA errors and had to wrap `with sdpa_kernel(SDPBackend.EFFICIENT_ATTENTION):` on https://github.com/allenai/molmoact/blob/8895cbadecd7ae0453d46a50bf6a73c9c8760b33/olmo/nn/image_vit.py#L215 

However, this will greatly slow down the training, for 80000 steps, it currently needs 21 days to finish.

I tried different dockers (including the one in this repo), cu124+torch2.6 and cu126+torch2.7. All these trials result in CUDA errors. Could anyone that successfully starts training share there libs and versions?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Post-training suffering from CUDA error #10

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Post-training suffering from CUDA error #10

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions