-
Notifications
You must be signed in to change notification settings - Fork 38
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Bug Description
Please provide a detailed description of the issue you encountered.
Environment Information
- Python Version: 3.12.4
- GPU: NVIDIA L20-40G * 8
- CUDA Version: 12.4
- Installation Method: git clone
- Trinity-RFT Version: 0.3.0.dev0
Steps to Reproduce
Please provide a minimal, self-contained, and reproducible example.
- trinity run --config examples/XXX/XXX.yaml
Expected Behavior
No interruptions.
Actual Behavior
During multi-GPU training, the process occasionally crashes with a NCCL watchdog hang. The error happens at rollout_weight_sync and terminates the job with SIGABRT.
Log Information
ProcessGroupNCCL.cpp:1554 [PG ID 6 PG GUID rollout_weight_sync Rank 3]
ProcessGroup watchdog hang due to timeout
*** SIGABRT received ...
Fatal Python error: Aborted

Question
Is this a known NCCL communication hang issue?
Any recommended configuration or workaround to prevent the crash?
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working