Skip to content

[QST] Training model on custom data gets stuck near the end #726

@Satwato

Description

@Satwato

❓ Questions & Help

I am trying to train transformers4rec on my own data, it gets stuck near the end and then times out. Running on 4 Tesla T4 GPUs. Code is pretty much the same as the examples. Just changed the data.

Details

Facing the following issue

[E ProcessGroupNCCL.cpp:828] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1803, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1807780 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:828] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1803, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1807788 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:828] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1803, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1807825 milliseconds before timing out.
finished
ip-:462258:462571 [0] NCCL INFO comm 0x6e95d6c0 rank 2 nranks 4 cudaDev 2 busId 1d0 - Abort COMPLETE
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
ip-:462256:462574 [0] NCCL INFO comm 0x6eb6e2d0 rank 0 nranks 4 cudaDev 0 busId 1b0 - Abort COMPLETE
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
ip-:462259:462577 [0] NCCL INFO comm 0x6f74aa40 rank 3 nranks 4 cudaDev 3 busId 1e0 - Abort COMPLETE
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
ip-:462257:462568 [0] NCCL INFO comm 0x6c1f0e50 rank 1 nranks 4 cudaDev 1 busId 1c0 - Abort COMPLETE
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 462256) of binary: /home/ubuntu/miniconda3/envs/merlin_env_2/bin/python
Traceback (most recent call last):
  File "/home/ubuntu/miniconda3/envs/merlin_env_2/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/ubuntu/miniconda3/envs/merlin_env_2/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/merlin_env_2/lib/python3.9/site-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/home/ubuntu/miniconda3/envs/merlin_env_2/lib/python3.9/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/home/ubuntu/miniconda3/envs/merlin_env_2/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/ubuntu/miniconda3/envs/merlin_env_2/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
=========================================================
all_feat_training_multi_row_part.py FAILED
---------------------------------------------------------
Failures:
[1]:
  time      : 2023-06-26_13:50:18
  host      : ip.ap-south-1.compute.internal
  rank      : 1 (local_rank: 1)
  exitcode  : -6 (pid: 462257)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 462257
[2]:
  time      : 2023-06-26_13:50:18
  host      : ip.ap-south-1.compute.internal
  rank      : 2 (local_rank: 2)
  exitcode  : -6 (pid: 462258)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 462258
[3]:
  time      : 2023-06-26_13:50:18
  host      : ip-ap-south-1.compute.internal
  rank      : 3 (local_rank: 3)
  exitcode  : -6 (pid: 462259)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 462259
---------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-06-26_13:50:18
  host      : ip-.ap-south-1.compute.internal
  rank      : 0 (local_rank: 0)
  exitcode  : -6 (pid: 462256)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 462256
=========================================================```
<!-- Description of your question -->

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions