Optimize a Citrinet model fine tuning in multi GPU environment #3078

francescodaq · 2021-10-28T14:23:23Z

francescodaq
Oct 28, 2021

Hello support,

we are attempting to move from a single GPU to multi GPU training environment.
The subject of training is the finetuning of a Citrinet-1024 model for speech recognition.

We executed a first fine tuning session on a single GPU machine (a single V100 with 16GB of memory), now we are moving to a new machine with 4 GPUs (4 T4 with 16GB of memory each).
The first training session featured a batch_size of 16 and a learning rate of 0,025.

The script we prepared for multi GPU fine tuning performs the following tasks:

loads the pre-trained model
changes some configuration parameters, especially learning rate (due to the increment of GPU number)
instantiates the Trainer object

   gpuN = 4
   epochs = 300
   accelerator_mode = "ddp"
   withLogger = False
   withCheckpointCallback=False
    
   trainer = pl.Trainer(gpus=gpuN, max_epochs=epochs, accelerator=accelerator_mode, logger=withLogger, 
checkpoint_callback=withCheckpointCallback)

starts the training

In order to benefit the increased hardware capacity we intended to keep the per GPU batch size to 16, thus obtaining an effective batch size of 64, but we get an OOM error.
We attempted decreasing the per GPU batch size, the greatest value avoiding OOM error is 12.

Observing the output of nvidia-smi command while training is running, we see that GPU0 has more memory allocated than the other 3, so maybe it is bootlnecking the others, causing the OOM.

Thu Oct 28 10:23:36 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.73.01    Driver Version: 460.73.01    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000001:00:00.0 Off |                    0 |
| N/A   68C    P0    63W /  70W |  14623MiB / 15109MiB |     99%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla T4            On   | 00000002:00:00.0 Off |                    0 |
| N/A   65C    P0    67W /  70W |  13232MiB / 15109MiB |     99%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla T4            On   | 00000003:00:00.0 Off |                    0 |
| N/A   72C    P0    72W /  70W |  13226MiB / 15109MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla T4            On   | 00000004:00:00.0 Off |                    0 |
| N/A   65C    P0    69W /  70W |  13282MiB / 15109MiB |     99%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

Is it correct?

Are we doing something wrong? Is there a way to distribute load in equally manner for all GPUs in order to maximize the benefits?
Do you have a tutorial/notebook or some article focusing about best practices for multi GPU training?

Thank you!
Francesco

titu1994 · 2021-10-29T16:21:00Z

titu1994
Oct 29, 2021
Maintainer

Rank 0 GPU requiring slightly more memory is expected under multi node DDP since it has some overhead for communicating with the other ranks within the same hardware network connection. However 1.3 GB is a bit too much. It's usually around 100-200 mb for us.

Make sure that other processes such as the graphics driver is not using the GPU. Or other processes are not using the rank 0 GPU.

This has some info, but I don't think there's something like best practices https://pytorch-lightning.readthedocs.io/en/stable/advanced/multi_gpu.html

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimize a Citrinet model fine tuning in multi GPU environment #3078

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Optimize a Citrinet model fine tuning in multi GPU environment #3078

Uh oh!

francescodaq Oct 28, 2021

Replies: 1 comment

Uh oh!

titu1994 Oct 29, 2021 Maintainer

francescodaq
Oct 28, 2021

titu1994
Oct 29, 2021
Maintainer