Skip to content

[BUG] NCCL hang during rollout_weight_sync (ProcessGroup watchdog timeout) #283

@Bat-Reality

Description

@Bat-Reality

Bug Description

Please provide a detailed description of the issue you encountered.

Environment Information

  • Python Version: 3.12.4
  • GPU: NVIDIA L20-40G * 8
  • CUDA Version: 12.4
  • Installation Method: git clone
  • Trinity-RFT Version: 0.3.0.dev0

Steps to Reproduce

Please provide a minimal, self-contained, and reproducible example.

  1. trinity run --config examples/XXX/XXX.yaml

Expected Behavior

No interruptions.

Actual Behavior

During multi-GPU training, the process occasionally crashes with a NCCL watchdog hang. The error happens at rollout_weight_sync and terminates the job with SIGABRT.

Log Information

ProcessGroupNCCL.cpp:1554 [PG ID 6 PG GUID rollout_weight_sync Rank 3] 
ProcessGroup watchdog hang due to timeout
*** SIGABRT received ...
Fatal Python error: Aborted
Image

Question

Is this a known NCCL communication hang issue?
Any recommended configuration or workaround to prevent the crash?

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions