Loss=nan when training/finetuning conformer model #3006

whrichd · 2021-10-14T17:55:03Z

whrichd
Oct 14, 2021

Hi, I've tried to train a conformer model from scratch & finetuning using the pretrained models from NGC. And I seem to be running into loss=nan far too often. I don't know if it's a bug or something about my config, or something about conformer in general, so I'm posting this here first, asking for help.

Especially for finetuning, the loss suddenly becomes nan after 2-20 iterations with the medium conformer (stt_en_conformer_ctc_medium). The large conformer seems to be stable for longer but I didn't test how long.

Using the same data and training a medium conformer has worked for me, but not on the first try. I initially encountered loss=nan after 7 epochs, and then changed the specaug options, enabled mixed precision training and it went away for no reason and I finished the training. But for finetuning it always fails within the first few iterations. I've tried various learning rates (Noam Annealing, so I've tried 0.5, 1, 2, 5), batch sizes, mixed precision options, specaug (time width from 100 to 0.05), no avail, tried the pretrained citrinet, it works fine.

I tried looking at the intermediate outputs, it seems like the input embeddings all look normal, but at some point on one GPU the encoder output starts to become nan for the whole batch which then gets propagated. After a while everything is nan. after the encoder for all GPUs.

So from my understanding, it doesn't look like faulty input because in the case of training from scratch it happens after a few epochs. It doesn't look like exploding gradient, I don't see the loss gradually diverging, it happens suddenly. I wonder if it has to do with learning rate or learning rate warm up because if it fails at the very beginning the lr is really really small.

BTW I'm training on 4 A100s, with a private data set.

Has anyone encountered this before? Anyone can help?

Answered by VahidooX

Oct 14, 2021

Is it Conformer-CTC or Conformer-Transducer?
We have not tested Conformer fully with mixed precision training. Sometimes loss explodes with mixed precision especially with low weight decay. Does it also happen with fp32?
What is the weight decay and warmup you use?
What is your lr when that happens?
Have you tried gradient clipping?
By "2-20 iterations" do you mean epochs?

View full answer

VahidooX · 2021-10-14T18:08:52Z

VahidooX
Oct 14, 2021
Collaborator

Is it Conformer-CTC or Conformer-Transducer?
We have not tested Conformer fully with mixed precision training. Sometimes loss explodes with mixed precision especially with low weight decay. Does it also happen with fp32?
What is the weight decay and warmup you use?
What is your lr when that happens?
Have you tried gradient clipping?
By "2-20 iterations" do you mean epochs?

8 replies

whrichd Oct 15, 2021
Author

With mixed precision training turned off it works just fine now. But I see that the improper length distribution having a impact on the training time if not WER. I’ll definitely try to segment them. Thanks!

mehadi92 Jun 4, 2023

@whrichd Does turning off mixed precision fully solve your issue? CC: @VahidooX

I'm facing the same issue #6809

whrichd Jun 5, 2023
Author

Hi yeah it fixed the issue. But I think they've since added more support for it, I've been training with BF16 recently.

VahidooX Jun 15, 2023
Collaborator

If you GPUs support bf16, you may use bf16 and double your batch size per GPU easily as It is safe.
With fp16, it should be safe in most cases but there can be some exceptions.

mehadi92 Jun 15, 2023

@VahidooX bf16 fix the NAN loss issue

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Loss=nan when training/finetuning conformer model #3006

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 8 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Loss=nan when training/finetuning conformer model #3006

Uh oh!

whrichd Oct 14, 2021

Replies: 1 comment · 8 replies

Uh oh!

VahidooX Oct 14, 2021 Collaborator

Uh oh!

whrichd Oct 15, 2021 Author

Uh oh!

mehadi92 Jun 4, 2023

Uh oh!

whrichd Jun 5, 2023 Author

Uh oh!

VahidooX Jun 15, 2023 Collaborator

Uh oh!

mehadi92 Jun 15, 2023

whrichd
Oct 14, 2021

Replies: 1 comment 8 replies

VahidooX
Oct 14, 2021
Collaborator

whrichd Oct 15, 2021
Author

whrichd Jun 5, 2023
Author

VahidooX Jun 15, 2023
Collaborator