-
Notifications
You must be signed in to change notification settings - Fork 49
Open
Description
When I tried to reproduce the results in the RLHFlow paper, I met some errors. This happens when I run get_rewards.py using 8 A100s.
[E ProcessGroupNCCL.cpp:475] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, NumelIn=1, NumelOut=3, Timeout(ms)=1800000) ran for 1800005 milliseconds before timing out.
Initially, I thought this was because the version of CUDA or NVCC was incorrect. However, fixing the version of NVCC / CUDA does not help.
Finally this was solved by adding
export NCCL_P2P_DISABLE=1
export NCCL_IB_DISABLE=1
to the shell file.
Suggestion: I am wondering whether we can add a document to record common errors we met.
Metadata
Metadata
Assignees
Labels
No labels