Inconsistent performance across runs? #5127

maxeduc · 2022-10-08T23:06:22Z

maxeduc
Oct 8, 2022

I was trying out training with conformer_transducer_bpe_streaming.yaml on Librispeech. I'm running the default settings except with batch size 8 (setting accumulate_grad_batches to 2 to compensate) and fused_batch_size 4 to get it to not OOM. (seems weird that would be necessary on a 24GB card, but that's beside the point).

My first run failed after about 6k steps due to some CUDA kernel timeout error (which is probably another bug, but that's also beside the point). I was hitting about 2it/s (or 1it/s in terms of the effective batch). My second run has been ~1.24it/s (.62it/s effective batch) consistently. I changed nothing the second time around. Does anyone know why someone would see a sudden decrease in performance between runs with no changes to anything else, or external programs running, or anything?

Loss graph looks the same (ignore the exploding loss in the second run, that's another separate issue I guess):

Training time is distinctly different:

The hyperparameters and computer load are seriously exactly the same between runs.

titu1994 · 2022-10-09T02:58:14Z

titu1994
Oct 9, 2022
Maintainer

RNNT always uses quite a bit of memory, our large models take around 32 GB to train on V100s at BS 8 with fp32. Fp16 can increase the batch size slightly.

Never got a cuda kernel timeout so dunno how to fix that. Maybe a one off thing.

Now to solve the actual question the first thing to ask is are you using cudnn auto tuning ? If so, disable it. It exhausts more memory for asr models and can cause inconsistency in runs if different kernel is picked.

0 replies

titu1994 · 2022-10-09T02:58:42Z

titu1994
Oct 9, 2022
Maintainer

@VahidooX if you hand some ideas, can you suggest ?

0 replies

maxeduc · 2022-10-09T17:00:32Z

maxeduc
Oct 9, 2022
Author

Thanks for the info. Makes sense that RNNT would be expensive.

As for cudnn autotuning, is torch.backends.cudnn.benchmark the setting you're referring to? If so, this:

trainer = pl.Trainer(**cfg.trainer)
exp_manager(trainer, cfg.get("exp_manager", None))
asr_model = EncDecRNNTBPEModel(cfg=cfg.model, trainer=trainer)

# Initialize the weights of the model from another model, if provided via config
asr_model.maybe_init_from_pretrained_checkpoint(cfg)

print("Autotuning on? ", torch.backends.cudnn.benchmark)

gives me

Autotuning on? False

I'm also getting NaN with these runs after a while, but the config documentation has some things to try tweaking for that. I think I also read FP16 can cause this? (which I'm running)

3 replies

titu1994 Oct 9, 2022
Maintainer

Which version of Nemo are you using ? 1.11 is necessary for fp16 Conformer. The flag for cudnn benchmarking is set in the PTL Trainer flag as benchmark: false. It won't show up before you call trainer.fit().

titu1994 Oct 9, 2022
Maintainer

Since your global batch size is too small, try to use grad accumulation to get around batch size of 128 or 256

maxeduc Oct 9, 2022
Author

I installed from source on the main branch about a week ago, so it should be past 1.11. Printing out the flags being sent to the trainer I see benchmark: false.

For the record my runs have been pretty consistent (~1.25it/s) with my recent attempts and I haven't been able to replicate the faster run since. That's also the only run that's given me the CUDA kernel timeout, so maybe this was just a CUDA bug or some driver weirdness.

Will try increasing the grad accumulation as well. Thanks again for the fast responses!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Inconsistent performance across runs? #5127

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Inconsistent performance across runs? #5127

Uh oh!

maxeduc Oct 8, 2022

Replies: 3 comments · 3 replies

Uh oh!

titu1994 Oct 9, 2022 Maintainer

Uh oh!

titu1994 Oct 9, 2022 Maintainer

Uh oh!

maxeduc Oct 9, 2022 Author

Uh oh!

titu1994 Oct 9, 2022 Maintainer

Uh oh!

titu1994 Oct 9, 2022 Maintainer

Uh oh!

Uh oh!

maxeduc Oct 9, 2022 Author

maxeduc
Oct 8, 2022

Replies: 3 comments 3 replies

titu1994
Oct 9, 2022
Maintainer

titu1994
Oct 9, 2022
Maintainer

maxeduc
Oct 9, 2022
Author

titu1994 Oct 9, 2022
Maintainer

titu1994 Oct 9, 2022
Maintainer

maxeduc Oct 9, 2022
Author