You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm using detectron2 on a HPC machine having 4 gpu per node. When I use one node it works well, when I try to launch on multiple nodes, using fo example 2 nodes and 8 gpus in total (4 gpu per node) I get:
RuntimeError: NCCL error in: ***/spack-src/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1269, internal error, NCCL version 2.14.3
From what I found in pytorch source code, such line raise the following error:
"Tensor list mustn't be larger than the number of available GPUs"
The following is the output of NCCL_INFO:
`lrdn1629:236813:236813 [1] NCCL INFO cudaDriverVersion 11080
lrdn1629:236813:236813 [1] NCCL INFO Bootstrap : Using ib0:10.128.31.149<0>
lrdn1629:236813:236813 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
lrdn1629:236813:236852 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/IB [3]mlx5_3:1/IB [RO]; OOB ib0:10.128.31.149<0>
lrdn1629:236813:236852 [1] NCCL INFO Using network IB
lrdn1629:236813:236852 [1] NCCL INFO Call to connect returned Connection refused, retrying
lrdn1629:236813:236852 [1] NCCL INFO Call to connect returned Connection refused, retrying
lrdn1629:236813:236852 [1] NCCL INFO Call to connect returned Connection refused, retrying
....
lrdn1629:236813:236852 [1] NCCL INFO Call to connect returned Connection refused, retrying
lrdn1629:236813:236852 [1] misc/socket.cc:456 NCCL WARN Net : Connect to 10.128.31.153<46829> failed : Connection refused
lrdn1629:236813:236852 [1] NCCL INFO bootstrap.cc:256 -> 6
lrdn1629:236813:236852 [1] NCCL INFO init.cc:516 -> 6
lrdn1629:236813:236852 [1] NCCL INFO init.cc:1089 -> 6
lrdn1629:236813:236852 [1] NCCL INFO group.cc:64 -> 6 [Async thread]
lrdn1629:236813:236813 [1] NCCL INFO group.cc:421 -> 3
lrdn1629:236813:236813 [1] NCCL INFO group.cc:106 -> 3
lrdn1629:236813:236813 [1] NCCL INFO comm 0x3843eb10 rank 1 nranks 8 cudaDev 1 busId 56000 - Abort COMPLETE`
When I run using 2 nodes and 2 gpus per node, the job seems stuck, no error but also no progress. The same code works well in other HPC machines also using multiple nodes.
Already ran Nccl-test and it works well. Any suggestion ? Thanks.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I'm using detectron2 on a HPC machine having 4 gpu per node. When I use one node it works well, when I try to launch on multiple nodes, using fo example 2 nodes and 8 gpus in total (4 gpu per node) I get:
RuntimeError: NCCL error in: ***/spack-src/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1269, internal error, NCCL version 2.14.3
From what I found in pytorch source code, such line raise the following error:
"Tensor list mustn't be larger than the number of available GPUs"
The following is the output of NCCL_INFO:
`lrdn1629:236813:236813 [1] NCCL INFO cudaDriverVersion 11080
lrdn1629:236813:236813 [1] NCCL INFO Bootstrap : Using ib0:10.128.31.149<0>
lrdn1629:236813:236813 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
lrdn1629:236813:236852 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/IB [3]mlx5_3:1/IB [RO]; OOB ib0:10.128.31.149<0>
lrdn1629:236813:236852 [1] NCCL INFO Using network IB
lrdn1629:236813:236852 [1] NCCL INFO Call to connect returned Connection refused, retrying
lrdn1629:236813:236852 [1] NCCL INFO Call to connect returned Connection refused, retrying
lrdn1629:236813:236852 [1] NCCL INFO Call to connect returned Connection refused, retrying
....
lrdn1629:236813:236852 [1] NCCL INFO Call to connect returned Connection refused, retrying
lrdn1629:236813:236852 [1] misc/socket.cc:456 NCCL WARN Net : Connect to 10.128.31.153<46829> failed : Connection refused
lrdn1629:236813:236852 [1] NCCL INFO bootstrap.cc:256 -> 6
lrdn1629:236813:236852 [1] NCCL INFO init.cc:516 -> 6
lrdn1629:236813:236852 [1] NCCL INFO init.cc:1089 -> 6
lrdn1629:236813:236852 [1] NCCL INFO group.cc:64 -> 6 [Async thread]
lrdn1629:236813:236813 [1] NCCL INFO group.cc:421 -> 3
lrdn1629:236813:236813 [1] NCCL INFO group.cc:106 -> 3
lrdn1629:236813:236813 [1] NCCL INFO comm 0x3843eb10 rank 1 nranks 8 cudaDev 1 busId 56000 - Abort COMPLETE`
When I run using 2 nodes and 2 gpus per node, the job seems stuck, no error but also no progress. The same code works well in other HPC machines also using multiple nodes.
Already ran Nccl-test and it works well. Any suggestion ? Thanks.
Beta Was this translation helpful? Give feedback.
All reactions