You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When using Accelerate to launch a training run with multiple processes using the MeshAutoencoderTrainer, either via MeshAutoencoderTrainer.forward or @MarcusLoppe's MeshAutoencoderTrainer.train method, the training runs into a deadlock causing an NCCL timeout.
It happens both using LFQ and regular residual VQ. I poked around a bit and narrowed the halt down to the LFQ quantization step, where some processes get stuck while the others wait following the backward pass. However, it could be that this is a symptom of something else and not necessarily something going wrong in vector-quantize-pytorch.
Timeout Trace
Usually the stack trace goes something like the following:
[rank0]:[E130 08:46:45.877106780 ProcessGroupNCCL.cpp:1484] [PG ID 0 PG GUID 0(default_pg) Rank 0] ProcessGroupNCCL's watchdog got stuck for 480 seconds without making progress in monitoring enqueued collectives. This typically indicates a NCCL/CUDA API (e.g., CudaEventDestroy) hang blocking the watchdog, and could be triggered by another thread holding the GIL inside a CUDA api (for example, CudaEventDestroy), or other deadlock-prone behaviors.If you suspect the watchdog is not actually stuck and a longer timeout would help, you can either increase the timeout (TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC) to a larger value or disable the heartbeat monitor (TORCH_NCCL_ENABLE_MONITORING=0).If either of aforementioned helps, feel free to file an issue to PyTorch about the short timeout or false positive abort; otherwise, please attempt to debug the hang.
[rank3]:[E130 08:46:45.883267972 ProcessGroupNCCL.cpp:1484] [PG ID 0 PG GUID 0(default_pg) Rank 3] ProcessGroupNCCL's watchdog got stuck for 480 seconds without making progress in monitoring enqueued collectives. This typically indicates a NCCL/CUDA API (e.g., CudaEventDestroy) hang blocking the watchdog, and could be triggered by another thread holding the GIL inside a CUDA api (for example, CudaEventDestroy), or other deadlock-prone behaviors.If you suspect the watchdog is not actually stuck and a longer timeout would help, you can either increase the timeout (TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC) to a larger value or disable the heartbeat monitor (TORCH_NCCL_ENABLE_MONITORING=0).If either of aforementioned helps, feel free to file an issue to PyTorch about the short timeout or false positive abort; otherwise, please attempt to debug the hang.
[rank1]:[E130 08:46:45.923293230 ProcessGroupNCCL.cpp:1484] [PG ID 0 PG GUID 0(default_pg) Rank 1] ProcessGroupNCCL's watchdog got stuck for 480 seconds without making progress in monitoring enqueued collectives. This typically indicates a NCCL/CUDA API (e.g., CudaEventDestroy) hang blocking the watchdog, and could be triggered by another thread holding the GIL inside a CUDA api (for example, CudaEventDestroy), or other deadlock-prone behaviors.If you suspect the watchdog is not actually stuck and a longer timeout would help, you can either increase the timeout (TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC) to a larger value or disable the heartbeat monitor (TORCH_NCCL_ENABLE_MONITORING=0).If either of aforementioned helps, feel free to file an issue to PyTorch about the short timeout or false positive abort; otherwise, please attempt to debug the hang.
[rank2]:[E130 08:46:45.981240046 ProcessGroupNCCL.cpp:1484] [PG ID 0 PG GUID 0(default_pg) Rank 2] ProcessGroupNCCL's watchdog got stuck for 480 seconds without making progress in monitoring enqueued collectives. This typically indicates a NCCL/CUDA API (e.g., CudaEventDestroy) hang blocking the watchdog, and could be triggered by another thread holding the GIL inside a CUDA api (for example, CudaEventDestroy), or other deadlock-prone behaviors.If you suspect the watchdog is not actually stuck and a longer timeout would help, you can either increase the timeout (TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC) to a larger value or disable the heartbeat monitor (TORCH_NCCL_ENABLE_MONITORING=0).If either of aforementioned helps, feel free to file an issue to PyTorch about the short timeout or false positive abort; otherwise, please attempt to debug the hang.
[rank0]:[F130 08:54:45.884409136 ProcessGroupNCCL.cpp:1306] [PG ID 0 PG GUID 0(default_pg) Rank 0] [PG ID 0 PG GUID 0(default_pg) Rank 0] Terminating the process after attempting to dump debug info, due to ProcessGroupNCCL watchdog hang.
[rank3]:[F130 08:54:45.896245575 ProcessGroupNCCL.cpp:1306] [PG ID 0 PG GUID 0(default_pg) Rank 3] [PG ID 0 PG GUID 0(default_pg) Rank 3] Terminating the process after attempting to dump debug info, due to ProcessGroupNCCL watchdog hang.
[rank1]:[F130 08:54:45.935828500 ProcessGroupNCCL.cpp:1306] [PG ID 0 PG GUID 0(default_pg) Rank 1] [PG ID 0 PG GUID 0(default_pg) Rank 1] Terminating the process after attempting to dump debug info, due to ProcessGroupNCCL watchdog hang.
[rank2]:[F130 08:54:45.988196955 ProcessGroupNCCL.cpp:1306] [PG ID 0 PG GUID 0(default_pg) Rank 2] [PG ID 0 PG GUID 0(default_pg) Rank 2] Terminating the process after attempting to dump debug info, due to ProcessGroupNCCL watchdog hang.
W0130 08:54:45.493000 133885 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 133941 closing signal SIGTERM
W0130 08:54:45.494000 133885 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 133942 closing signal SIGTERM
W0130 08:54:45.495000 133885 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 133943 closing signal SIGTERM
E0130 08:54:45.874000 133885 torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: -6) local_rank: 3 (pid: 133944) of binary: .../.venv/bin/python3
Traceback (most recent call last):
File ".../.venv/bin/accelerate", line 10, in <module>
sys.exit(main())
^^^^^^
File ".../.venv/lib/python3.12/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
args.func(args)
File ".../.venv/lib/python3.12/site-packages/accelerate/commands/launch.py", line 1163, in launch_command
multi_gpu_launcher(args)
File ".../.venv/lib/python3.12/site-packages/accelerate/commands/launch.py", line 792, in multi_gpu_launcher
distrib_run.run(args)
File ".../.venv/lib/python3.12/site-packages/torch/distributed/run.py", line 910, in run
elastic_launch(
File ".../.venv/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File ".../.venv/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
train.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : ...
host : ...
rank : 3 (local_rank: 3)
exitcode : -6 (pid: 133944)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 133944
============================================================
The text was updated successfully, but these errors were encountered:
Description
When using Accelerate to launch a training run with multiple processes using the
MeshAutoencoderTrainer
, either viaMeshAutoencoderTrainer.forward
or @MarcusLoppe'sMeshAutoencoderTrainer.train
method, the training runs into a deadlock causing an NCCL timeout.It happens both using LFQ and regular residual VQ. I poked around a bit and narrowed the halt down to the LFQ quantization step, where some processes get stuck while the others wait following the backward pass. However, it could be that this is a symptom of something else and not necessarily something going wrong in
vector-quantize-pytorch
.Timeout Trace
Usually the stack trace goes something like the following:
The text was updated successfully, but these errors were encountered: