ncclInternalError: Internal check failed #5049

unrue · 2023-08-04T11:52:04Z

unrue
Aug 4, 2023

I'm using detectron2 on a HPC machine having 4 gpu per node. When I use one node it works well, when I try to launch on multiple nodes, using fo example 2 nodes and 8 gpus in total (4 gpu per node) I get:

RuntimeError: NCCL error in: ***/spack-src/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1269, internal error, NCCL version 2.14.3

From what I found in pytorch source code, such line raise the following error:

"Tensor list mustn't be larger than the number of available GPUs"

The following is the output of NCCL_INFO:

`lrdn1629:236813:236813 [1] NCCL INFO cudaDriverVersion 11080
lrdn1629:236813:236813 [1] NCCL INFO Bootstrap : Using ib0:10.128.31.149<0>
lrdn1629:236813:236813 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
lrdn1629:236813:236852 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/IB [3]mlx5_3:1/IB [RO]; OOB ib0:10.128.31.149<0>
lrdn1629:236813:236852 [1] NCCL INFO Using network IB
lrdn1629:236813:236852 [1] NCCL INFO Call to connect returned Connection refused, retrying
lrdn1629:236813:236852 [1] NCCL INFO Call to connect returned Connection refused, retrying
lrdn1629:236813:236852 [1] NCCL INFO Call to connect returned Connection refused, retrying
....
lrdn1629:236813:236852 [1] NCCL INFO Call to connect returned Connection refused, retrying

lrdn1629:236813:236852 [1] misc/socket.cc:456 NCCL WARN Net : Connect to 10.128.31.153<46829> failed : Connection refused
lrdn1629:236813:236852 [1] NCCL INFO bootstrap.cc:256 -> 6
lrdn1629:236813:236852 [1] NCCL INFO init.cc:516 -> 6
lrdn1629:236813:236852 [1] NCCL INFO init.cc:1089 -> 6
lrdn1629:236813:236852 [1] NCCL INFO group.cc:64 -> 6 [Async thread]
lrdn1629:236813:236813 [1] NCCL INFO group.cc:421 -> 3
lrdn1629:236813:236813 [1] NCCL INFO group.cc:106 -> 3
lrdn1629:236813:236813 [1] NCCL INFO comm 0x3843eb10 rank 1 nranks 8 cudaDev 1 busId 56000 - Abort COMPLETE`

When I run using 2 nodes and 2 gpus per node, the job seems stuck, no error but also no progress. The same code works well in other HPC machines also using multiple nodes.

Already ran Nccl-test and it works well. Any suggestion ? Thanks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ncclInternalError: Internal check failed #5049

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

ncclInternalError: Internal check failed #5049

Uh oh!

Uh oh!

unrue Aug 4, 2023

Replies: 0 comments

unrue
Aug 4, 2023