Multi GPU training hangs #448

muntasir2000 · 2019-05-24T18:58:56Z

When I try to train DeepSpeech2 using example configs using 3 GPUs, training hangs indefinitely. But single GPU training works well using same config file. I also tried using horovod. Same problem.
I'm using nvcr.io/nvidia/tensorflow:18.12-py3 docker image

borisgin · 2019-05-24T19:21:21Z

Can you attach the log file, please?

muntasir2000 · 2019-05-24T20:06:16Z

`
*** Starting training from scratch
*** Training config:
{'batch_size_per_gpu': 20,
'data_layer': <class 'open_seq2seq.data.speech2text.speech2text.Speech2TextDataLayer'>,
'data_layer_params': {'augmentation': {'noise_level_max': -60,
'noise_level_min': -90,
'speed_perturbation_ratio': 0.1},
'dataset_files': ['/hdd/stt-16k-seq2seq-train.csv'],
'input_type': 'spectrogram',
'max_duration': 16.7,
'num_audio_features': 160,
'shuffle': True,
'vocab_file': 'open_seq2seq/test_utils/toy_speech_data/vocab.txt'},
'decoder': <class 'open_seq2seq.decoders.fc_decoders.FullyConnectedCTCDecoder'>,
'decoder_params': {'alpha': 2.0,
'alphabet_config_path': 'open_seq2seq/test_utils/toy_speech_data/vocab.txt',
'beam_width': 512,
'beta': 1.0,
'decoder_library_path': 'ctc_decoder_with_lm/libctc_decoder_with_kenlm.so',
'lm_path': '/lm/lm.binary',
'trie_path': 'language_model/trie.binary',
'use_language_model': False},
'dtype': tf.float32,
'encoder': <class 'open_seq2seq.encoders.ds2_encoder.DeepSpeech2Encoder'>,
'encoder_params': {'activation_fn': <function relu at 0x7f2f9e611bf8>,
'conv_layers': [{'kernel_size': [11, 41],
'num_channels': 32,
'padding': 'SAME',
'stride': [2, 2]},
{'kernel_size': [11, 21],
'num_channels': 64,
'padding': 'SAME',
'stride': [1, 2]},
{'kernel_size': [11, 21],
'num_channels': 96,
'padding': 'SAME',
'stride': [1, 2]}],
'data_format': 'channels_first',
'dropout_keep_prob': 0.5,
'n_hidden': 1600,
'num_rnn_layers': 5,
'rnn_cell_dim': 800,
'rnn_type': 'cudnn_gru',
'rnn_unidirectional': False,
'row_conv': False,
'use_cudnn_rnn': True},
'eval_steps': 500,
'initializer': <function xavier_initializer at 0x7f2f800be9d8>,
'larc_params': {'larc_eta': 0.001},
'load_model': '',
'logdir': 'experiments/2-mfi/logs',
'loss': <class 'open_seq2seq.losses.ctc_loss.CTCLoss'>,
'loss_params': {},
'lr_policy': <function poly_decay at 0x7f2f79918840>,
'lr_policy_params': {'learning_rate': 0.0001, 'power': 0.5},
'num_epochs': 50,
'num_gpus': 3,
'optimizer': 'Adam',
'print_loss_steps': 10,
'print_samples_steps': 500,
'random_seed': 0,
'regularizer': <function l2_regularizer at 0x7f2f80022c80>,
'regularizer_params': {'scale': 0.0005},
'save_checkpoint_steps': 1000,
'save_summaries_steps': 100,
'summaries': ['learning_rate',
'variables',
'gradients',
'larc_summaries',
'variable_norm',
'gradient_norm',
'global_gradient_norm'],
'use_horovod': False,
'use_xla_jit': False}
*** Evaluation config:
{'batch_size_per_gpu': 20,
'data_layer': <class 'open_seq2seq.data.speech2text.speech2text.Speech2TextDataLayer'>,
'data_layer_params': {'dataset_files': ['/hdd/stt-16k-seq2seq-dev.csv'],
'input_type': 'spectrogram',
'num_audio_features': 160,
'shuffle': False,
'vocab_file': 'open_seq2seq/test_utils/toy_speech_data/vocab.txt'},
'decoder': <class 'open_seq2seq.decoders.fc_decoders.FullyConnectedCTCDecoder'>,
'decoder_params': {'alpha': 2.0,
'alphabet_config_path': 'open_seq2seq/test_utils/toy_speech_data/vocab.txt',
'beam_width': 512,
'beta': 1.0,
'decoder_library_path': 'ctc_decoder_with_lm/libctc_decoder_with_kenlm.so',
'lm_path': '/lm/lm.binary',
'trie_path': 'language_model/trie.binary',
'use_language_model': False},
'dtype': tf.float32,
'encoder': <class 'open_seq2seq.encoders.ds2_encoder.DeepSpeech2Encoder'>,
'encoder_params': {'activation_fn': <function relu at 0x7f2f9e611bf8>,
'conv_layers': [{'kernel_size': [11, 41],
'num_channels': 32,
'padding': 'SAME',
'stride': [2, 2]},
{'kernel_size': [11, 21],
'num_channels': 64,
'padding': 'SAME',
'stride': [1, 2]},
{'kernel_size': [11, 21],
'num_channels': 96,
'padding': 'SAME',
'stride': [1, 2]}],
'data_format': 'channels_first',
'dropout_keep_prob': 0.5,
'n_hidden': 1600,
'num_rnn_layers': 5,
'rnn_cell_dim': 800,
'rnn_type': 'cudnn_gru',
'rnn_unidirectional': False,
'row_conv': False,
'use_cudnn_rnn': True},
'eval_steps': 500,
'initializer': <function xavier_initializer at 0x7f2f800be9d8>,
'larc_params': {'larc_eta': 0.001},
'load_model': '',
'logdir': 'experiments/2-mfi/logs',
'loss': <class 'open_seq2seq.losses.ctc_loss.CTCLoss'>,
'loss_params': {},
'lr_policy': <function poly_decay at 0x7f2f79918840>,
'lr_policy_params': {'learning_rate': 0.0001, 'power': 0.5},
'num_epochs': 50,
'num_gpus': 3,
'optimizer': 'Adam',
'print_loss_steps': 10,
'print_samples_steps': 500,
'random_seed': 0,
'regularizer': <function l2_regularizer at 0x7f2f80022c80>,
'regularizer_params': {'scale': 0.0005},
'save_checkpoint_steps': 1000,
'save_summaries_steps': 100,
'summaries': ['learning_rate',
'variables',
'gradients',
'larc_summaries',
'variable_norm',
'gradient_norm',
'global_gradient_norm'],
'use_horovod': False,
'use_xla_jit': False}
*** Building graph on GPU:0
*** Building graph on GPU:1
*** Building graph on GPU:2
*** Trainable variables:
*** ForwardPass/ds2_encoder/conv1/kernel:0
*** shape: (11, 41, 1, 32), <dtype: 'float32_ref'>
*** ForwardPass/ds2_encoder/conv1/bn/gamma:0
*** shape: (32,), <dtype: 'float32_ref'>
*** ForwardPass/ds2_encoder/conv1/bn/beta:0
*** shape: (32,), <dtype: 'float32_ref'>
*** ForwardPass/ds2_encoder/conv2/kernel:0
*** shape: (11, 21, 32, 64), <dtype: 'float32_ref'>
*** ForwardPass/ds2_encoder/conv2/bn/gamma:0
*** shape: (64,), <dtype: 'float32_ref'>
*** ForwardPass/ds2_encoder/conv2/bn/beta:0
*** shape: (64,), <dtype: 'float32_ref'>
*** ForwardPass/ds2_encoder/conv3/kernel:0
*** shape: (11, 21, 64, 96), <dtype: 'float32_ref'>
*** ForwardPass/ds2_encoder/conv3/bn/gamma:0
*** shape: (96,), <dtype: 'float32_ref'>
*** ForwardPass/ds2_encoder/conv3/bn/beta:0
*** shape: (96,), <dtype: 'float32_ref'>
*** ForwardPass/ds2_encoder/cudnn_gru/opaque_kernel:0
*** shape: , <dtype: 'float32_ref'>
*** ForwardPass/ds2_encoder/fully_connected/kernel:0
*** shape: (1600, 1600), <dtype: 'float32_ref'>
*** ForwardPass/ds2_encoder/fully_connected/bias:0
*** shape: (1600,), <dtype: 'float32_ref'>
*** ForwardPass/fully_connected_ctc_decoder/fully_connected/kernel:0
*** shape: (1600, 66), <dtype: 'float32_ref'>
*** ForwardPass/fully_connected_ctc_decoder/fully_connected/bias:0
*** shape: (66,), <dtype: 'float32_ref'>
*** Encountered unknown variable shape, can't compute total number of parameters.
*** Building graph on GPU:0
*** Building graph on GPU:1
*** Building graph on GPU:2
2019-05-25 01:56:26.454436: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-05-25 01:56:26.644988: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.721
pciBusID: 0000:09:00.0
totalMemory: 10.92GiB freeMemory: 10.77GiB
2019-05-25 01:56:26.767056: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 1 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.6325
pciBusID: 0000:0a:00.0
totalMemory: 10.92GiB freeMemory: 10.77GiB
2019-05-25 01:56:26.855649: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 2 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.645
pciBusID: 0000:41:00.0
totalMemory: 10.91GiB freeMemory: 10.63GiB
2019-05-25 01:56:26.859528: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0, 1, 2
2019-05-25 01:56:28.743446: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-05-25 01:56:28.743487: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 1 2
2019-05-25 01:56:28.743499: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N Y Y
2019-05-25 01:56:28.743508: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 1: Y N Y
2019-05-25 01:56:28.743517: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 2: Y Y N
2019-05-25 01:56:28.744812: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10419 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:09:00.0, compute capability: 6.1)
2019-05-25 01:56:28.746374: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 10419 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:0a:00.0, compute capability: 6.1)
2019-05-25 01:56:28.746584: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 10280 MB memory) -> physical GPU (device: 2, name: GeForce GTX 1080 Ti, pci bus id: 0000:41:00.0, compute capability: 6.1)

`

This is the log file. Please note, this log was generated running without docker. But the problem is same with docker. It's just stuck there. Even I can't kill the process without restarting the PC.

Here is the output of nvidia-smi, if it helps. Thanks

borisgin · 2019-05-24T20:54:29Z

Maybe mismatch between CUDA version/ driver and TF container.
Can you try latest container: tensorflow:19.04-py3 or tensorflow:19.05-py3, please?

muntasir2000 · 2019-05-24T21:04:54Z

I also tried without using docker container. Anyways, I'll try using tensorflow:19.05-py3 image.

muntasir2000 · 2019-05-27T21:11:08Z

I tried using tensorflow:19.05-py3 docker image. Same issue. Training hangs
Log file follows -

WARNING: Please update time_stretch_ratio to speed_perturbation_ratio
WARNING: Please update time_stretch_ratio to speed_perturbation_ratio
*** Building graph on GPU:0
WARNING:tensorflow:From /workspace/OpenSeq2Seq/open_seq2seq/data/speech2text/speech2text.py:216: py_func (from tensorflow.python.ops.script_ops) is deprecated and will be removed in a future version.
Instructions for updating:
tf.py_func is deprecated in TF V2. Instead, use
tf.py_function, which takes a python function which manipulates tf eager
tensors instead of numpy arrays. It's easy to convert a tf eager tensor to
an ndarray (just call tensor.numpy()) but having access to eager tensors
means tf.py_functions can use accelerators such as GPUs as well as
being differentiable using a gradient tape.

WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/python/data/ops/dataset_ops.py:1419: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
WARNING:tensorflow:From /workspace/OpenSeq2Seq/open_seq2seq/parts/cnns/conv_blocks.py:159: conv2d (from tensorflow.python.layers.convolutional) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.conv2d instead.
WARNING:tensorflow:From /workspace/OpenSeq2Seq/open_seq2seq/parts/cnns/conv_blocks.py:177: batch_normalization (from tensorflow.python.layers.normalization) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.batch_normalization instead.
WARNING:tensorflow:From /workspace/OpenSeq2Seq/open_seq2seq/encoders/ds2_encoder.py:387: dense (from tensorflow.python.layers.core) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.dense instead.
WARNING:tensorflow:From /workspace/OpenSeq2Seq/open_seq2seq/encoders/ds2_encoder.py:389: calling dropout (from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in a future version.
Instructions for updating:
Please use rate instead of keep_prob. Rate should be set to rate = 1 - keep_prob.
*** Building graph on GPU:1
WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/python/training/learning_rate_decay_v2.py:321: div (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Deprecated in favor of operator or tf.math.divide.
*** Trainable variables:
*** ForwardPass/ds2_encoder/conv1/kernel:0
*** shape: (11, 41, 1, 32), <dtype: 'float32_ref'>
*** ForwardPass/ds2_encoder/conv1/bn/gamma:0
*** shape: (32,), <dtype: 'float32_ref'>
*** ForwardPass/ds2_encoder/conv1/bn/beta:0
*** shape: (32,), <dtype: 'float32_ref'>
*** ForwardPass/ds2_encoder/conv2/kernel:0
*** shape: (11, 21, 32, 64), <dtype: 'float32_ref'>
*** ForwardPass/ds2_encoder/conv2/bn/gamma:0
*** shape: (64,), <dtype: 'float32_ref'>
*** ForwardPass/ds2_encoder/conv2/bn/beta:0
*** shape: (64,), <dtype: 'float32_ref'>
*** ForwardPass/ds2_encoder/conv3/kernel:0
*** shape: (11, 21, 64, 96), <dtype: 'float32_ref'>
*** ForwardPass/ds2_encoder/conv3/bn/gamma:0
*** shape: (96,), <dtype: 'float32_ref'>
*** ForwardPass/ds2_encoder/conv3/bn/beta:0
*** shape: (96,), <dtype: 'float32_ref'>
*** ForwardPass/ds2_encoder/cudnn_gru/opaque_kernel:0
*** shape: , <dtype: 'float32_ref'>
*** ForwardPass/ds2_encoder/fully_connected/kernel:0
*** shape: (1600, 1600), <dtype: 'float32_ref'>
*** ForwardPass/ds2_encoder/fully_connected/bias:0
*** shape: (1600,), <dtype: 'float32_ref'>
*** ForwardPass/fully_connected_ctc_decoder/fully_connected/kernel:0
*** shape: (1600, 66), <dtype: 'float32_ref'>
*** ForwardPass/fully_connected_ctc_decoder/fully_connected/bias:0
*** shape: (66,), <dtype: 'float32_ref'>
*** Encountered unknown variable shape, can't compute total number of parameters.
*** Building graph on GPU:0
*** Building graph on GPU:1
2019-05-27 21:05:16.629783: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3792535000 Hz
2019-05-27 21:05:16.631137: I tensorflow/compiler/xla/service/service.cc:161] XLA service 0x12990420 executing computations on platform Host. Devices:
2019-05-27 21:05:16.631165: I tensorflow/compiler/xla/service/service.cc:168] StreamExecutor device (0): ,
2019-05-27 21:05:16.865771: I tensorflow/compiler/xla/service/service.cc:161] XLA service 0x12aacfb0 executing computations on platform CUDA. Devices:
2019-05-27 21:05:16.865821: I tensorflow/compiler/xla/service/service.cc:168] StreamExecutor device (0): GeForce GTX 1080 Ti, Compute Capability 6.1
2019-05-27 21:05:16.865832: I tensorflow/compiler/xla/service/service.cc:168] StreamExecutor device (1): GeForce GTX 1080 Ti, Compute Capability 6.1
2019-05-27 21:05:16.866588: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.6325
pciBusID: 0000:0a:00.0
totalMemory: 10.92GiB freeMemory: 10.77GiB
2019-05-27 21:05:16.867113: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 1 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.645
pciBusID: 0000:41:00.0
totalMemory: 10.91GiB freeMemory: 10.38GiB
2019-05-27 21:05:16.868351: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0, 1
2019-05-27 21:05:18.512599: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-05-27 21:05:18.512643: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0 1
2019-05-27 21:05:18.512655: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0: N Y
2019-05-27 21:05:18.512659: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 1: Y N
2019-05-27 21:05:18.513611: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10413 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:0a:00.0, compute capability: 6.1)
2019-05-27 21:05:18.514182: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 10034 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:41:00.0, compute capability: 6.1)
2019-05-27 21:05:38.831218: I tensorflow/stream_executor/dso_loader.cc:153] successfully opened CUDA library libcublas.so.10 locally

Output from nvidia-smi - (using gpu 1 and 2, gpu 0 is being used by another process)

borisgin · 2019-05-28T00:21:57Z

Thanks, looks like a bug, I will check with our TF team for possible reason and solution.

borisgin · 2019-06-06T16:32:47Z

Can you check if you can successfully run these nccl tests on that machine? https://github.com/nvidia/nccl-tests

muntasir2000 · 2019-06-18T12:00:37Z

I tried to run nccl-tests, but the test also hangs the same way OpenSeq2Seq hangs. All GPUs show 100% usage constantly but hangs.
I'm trying to follow this -
NVIDIA/caffe#10

I'll post the result. Thanks.

lorinczb · 2019-07-18T11:17:57Z

I am trying to run tacotron-gst on a single GPU, but hangs at the same spot, does not get past: Successfully opened dynamic library libcublas.so.10.0 this line. Was this issue resolved? I am running it on colaboratory.

borisgin · 2019-07-18T11:51:20Z

Since this is not related to multi-GPU, can you open a new issue "Tacotron hangs on single GPU", please? Please attach the following

system information - Ubuntu version, GPU, driver version (nvidia-smi)
TF container information
log file

Shikherneo2 · 2019-12-02T23:51:02Z

Was this problem ever resolved? I am facing the same issue as @lorinczb

MinaJf · 2020-11-05T18:01:23Z

I have the same issue, any new idea?

swarajdalmia · 2021-04-09T07:50:24Z

Facing a similar issue for tacotron-GST. Any idea how to resolve ?

borisgin added the bug label May 28, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi GPU training hangs #448

Multi GPU training hangs #448

muntasir2000 commented May 24, 2019 •

edited

Loading

borisgin commented May 24, 2019

muntasir2000 commented May 24, 2019 •

edited

Loading

borisgin commented May 24, 2019

muntasir2000 commented May 24, 2019

muntasir2000 commented May 27, 2019 •

edited

Loading

borisgin commented May 28, 2019 •

edited

Loading

borisgin commented Jun 6, 2019

muntasir2000 commented Jun 18, 2019

lorinczb commented Jul 18, 2019

borisgin commented Jul 18, 2019

Shikherneo2 commented Dec 2, 2019

MinaJf commented Nov 5, 2020

swarajdalmia commented Apr 9, 2021 •

edited

Loading

Multi GPU training hangs #448

Multi GPU training hangs #448

Comments

muntasir2000 commented May 24, 2019 • edited Loading

borisgin commented May 24, 2019

muntasir2000 commented May 24, 2019 • edited Loading

borisgin commented May 24, 2019

muntasir2000 commented May 24, 2019

muntasir2000 commented May 27, 2019 • edited Loading

borisgin commented May 28, 2019 • edited Loading

borisgin commented Jun 6, 2019

muntasir2000 commented Jun 18, 2019

lorinczb commented Jul 18, 2019

borisgin commented Jul 18, 2019

Shikherneo2 commented Dec 2, 2019

MinaJf commented Nov 5, 2020

swarajdalmia commented Apr 9, 2021 • edited Loading

muntasir2000 commented May 24, 2019 •

edited

Loading

muntasir2000 commented May 24, 2019 •

edited

Loading

muntasir2000 commented May 27, 2019 •

edited

Loading

borisgin commented May 28, 2019 •

edited

Loading

swarajdalmia commented Apr 9, 2021 •

edited

Loading