-
Notifications
You must be signed in to change notification settings - Fork 369
Multi GPU training hangs #448
Comments
Can you attach the log file, please? |
` ` This is the log file. Please note, this log was generated running without docker. But the problem is same with docker. It's just stuck there. Even I can't kill the process without restarting the PC. |
Maybe mismatch between CUDA version/ driver and TF container. |
I also tried without using docker container. Anyways, I'll try using tensorflow:19.05-py3 image. |
Thanks, looks like a bug, I will check with our TF team for possible reason and solution. |
Can you check if you can successfully run these nccl tests on that machine? https://github.com/nvidia/nccl-tests |
I tried to run nccl-tests, but the test also hangs the same way OpenSeq2Seq hangs. All GPUs show 100% usage constantly but hangs. I'll post the result. Thanks. |
I am trying to run tacotron-gst on a single GPU, but hangs at the same spot, does not get past: Successfully opened dynamic library libcublas.so.10.0 this line. Was this issue resolved? I am running it on colaboratory. |
Since this is not related to multi-GPU, can you open a new issue "Tacotron hangs on single GPU", please? Please attach the following
|
Was this problem ever resolved? I am facing the same issue as @lorinczb |
I have the same issue, any new idea? |
Facing a similar issue for tacotron-GST. Any idea how to resolve ? |
When I try to train DeepSpeech2 using example configs using 3 GPUs, training hangs indefinitely. But single GPU training works well using same config file. I also tried using horovod. Same problem.
I'm using nvcr.io/nvidia/tensorflow:18.12-py3 docker image
The text was updated successfully, but these errors were encountered: