Loss doesn't change when resuming training #5721

Th3Moody · 2023-01-02T10:01:58Z

Th3Moody
Jan 2, 2023

Hello guys,
I am currently building an ASR model for Arabic-Egyptian slang. I started with citrinet-512 and it is training just fine. The problem occurs when trying to resume training; the loss doesn't change at all.

My experiment:
First one, I have let the model to train for 10 epochs and the loss dropped from 463 to 104. Then using the experiment manager, I resumed the training for another 10 epochs, but the loss nearly didn't change but even became worse (from 106 to 109).

Second one, I made the same exact experiment but left the model to train for 20 epochs without any pausing/resuming. For the first 10 epochs the loss dropped from 458 to 95 (pretty much the same as in the first experiment). The second 10 epochs the loss dropped from 95 to 82 (and now comes the difference).

============
I have been through the documentation for so long:

tried changing the hyper-parameters
tried changing training data totally

but nothing fixed the problem

============
Note:
I noticed that if I took one of the first experiment checkpoints and fined tuned, the loss will change and it will be training normally. That is why I am suspecting the NeMo experiment manager.

============
Logs are attached.
Python packages and versions:

nemo-toolkit 1.12.0
Cython 0.29.32
pytorch-lightning 1.7.6
torch 1.12.1
torchaudio 0.12.1
torchmetrics 0.10.2
torchvision 0.13.1

My code:

import nemo.collections.asr as nemo_asr
import pytorch_lightning as pl
from pytorch_lightning.strategies import *
from omegaconf import OmegaConf
from nemo.utils import exp_manager
from os.path import join
import copy

##########################################################

NUM_EPOCHS = 20
MAIN_DIR = "xxxxx/xxxxx/xxxxxx"

##########################################################

config = OmegaConf.load(join(MAIN_DIR, "conf/citrinet_512.yaml"))

config.model.tokenizer.dir = join(MAIN_DIR, "tokenizers/tokenizer_spe_bpe_v512")
config.model.tokenizer.type = "bpe"


# AIC + Train_Data
config.model.train_ds.manifest_filepath = join(MAIN_DIR, "manifests/ready_manifests/train-16secs.json")
config.model.validation_ds.manifest_filepath = join(MAIN_DIR, "manifests/ready_manifests/validation.json")
config.model.test_ds.manifest_filepath = join(MAIN_DIR, "manifests/ready_manifests/test.json")




trainer = pl.Trainer(
    devices= 4, # number of gpus
    max_epochs= NUM_EPOCHS,
    max_steps= -1, # computed at runtime if not set
    num_nodes= 1,
    accelerator= 'gpu',
    strategy='ddp',
    accumulate_grad_batches= 1,
    enable_checkpointing= False, # Provided by exp_manager
    logger= False,  # Provided by exp_manager
    log_every_n_steps= 100,  # Interval of logging.
    val_check_interval= 1.0, # Set to 0.25 to check 4 times per epoch, or an int for number of iterations
    check_val_every_n_epoch= 1,
    precision= 32,
    sync_batchnorm= False,
    benchmark= False # needs to be false for models with variable-length speech input as it slows down training)
)


# model = nemo_asr.models.EncDecCTCModelBPE.restore_from("models/citrinet512_bpe512_ep10.nemo")
model = nemo_asr.models.EncDecCTCModelBPE(cfg=config.model)
# checkpoint_path = "/home/m.hossam/NeMo_projects/asr_tests_by7oda/wael_data_citri512_exp_3_node12/Citrinet-512-8x-Stride/checkpoints/Citrinet-512-8x-Stride--val_wer=0.7471-epoch=30-last.ckpt"
# model = nemo_asr.models.EncDecCTCModelBPE.load_from_checkpoint(checkpoint_path=checkpoint_path)
print(model.summarize())


# tried re-setting the datasets
# print("Setting train data...")
# model.setup_training_data(config.model.train_ds)
# print("Setting validation data...")
# model.setup_validation_data(config.model.validation_ds)
# print("Setting test data...")
# model.setup_test_data(config.model.test_ds)

# tried to change the optim parameters 
# print("Setting optimization params...")
# new_optim = copy.deepcopy(config.model.optim)
# new_optim.lr = 0.09
# model.setup_optimization(new_optim)

# print("Set trainer")
# model.set_trainer(trainer)

print("Initialize Experiment Manager...")
config.exp_manager.exp_dir = "citrinet512bpe512_2"
config.exp_manager.resume_if_exists = "true"
config.exp_manager.resume_ignore_no_checkpoint= "true"
experiment_manager = exp_manager.exp_manager(trainer=trainer, cfg=config.exp_manager)


print("Training...")
trainer.fit(model=model)

# Save model
print("Saving model...")
model.save_to(str("models/citrinet512_bpe512_2_ep"+str(NUM_EPOCHS)+".nemo"))

NeMo-exp-1-10ep-1.log
NeMo-exp-1-10ep-2.log
NeMo-exp-2-20ep.log

Answered by titu1994

Jan 2, 2023

First one, I have let the model to train for 10 epochs and the loss dropped from 463 to 104. Then using the experiment manager, I resumed the training for another 10 epochs, but the loss nearly didn't change but even became worse (from 106 to 109).

I.e. you set your model to train for 10 epochs. Let that one finish, then use exp manager with that for further 10 epochs?

This will not work. Take a look at the LR of the second run - it should be close to 0 (or whatever min_lr you set).

The resume functionality is meant to be used from the beginning of the training run - and is only useful if you stop in the middle. It cannot be used to continue an already finished run unless you were super…

View full answer

titu1994 · 2023-01-02T10:09:14Z

titu1994
Jan 2, 2023
Maintainer

First one, I have let the model to train for 10 epochs and the loss dropped from 463 to 104. Then using the experiment manager, I resumed the training for another 10 epochs, but the loss nearly didn't change but even became worse (from 106 to 109).

I.e. you set your model to train for 10 epochs. Let that one finish, then use exp manager with that for further 10 epochs?

This will not work. Take a look at the LR of the second run - it should be close to 0 (or whatever min_lr you set).

The resume functionality is meant to be used from the beginning of the training run - and is only useful if you stop in the middle. It cannot be used to continue an already finished run unless you were super careful with param and optimization settings (we don't encourage that kind of training).

The second run worked cause that's how exp manager is supposed to be used - a fixed large number of steps at the beginning and you run multiple chained runs until you finish it.

3 replies

Th3Moody Jan 2, 2023
Author

Hello Somshubra,
Thank you for your clarification. So now, I am training on a SLURM environment and the admins made a time limit for each job to be 24 hours at most. The job will be kicked out (literally) of the node if it hits the time limit. How can I split my training process? set the training epochs to 100 (for example), and then resume from whatever the last saved checkpoint?

titu1994 Jan 2, 2023
Maintainer

Yes. We do the same for slurm training. Set it to a large enough number of steps, and let it run as needed. If the job does not finish simply run the same experiment again with the same directory for exp manager. It will handle it.

Th3Moody Jan 2, 2023
Author

Thank you ❤️

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Loss doesn't change when resuming training #5721

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Loss doesn't change when resuming training #5721

Uh oh!

Th3Moody Jan 2, 2023

Replies: 1 comment · 3 replies

Uh oh!

titu1994 Jan 2, 2023 Maintainer

Uh oh!

Th3Moody Jan 2, 2023 Author

Uh oh!

titu1994 Jan 2, 2023 Maintainer

Uh oh!

Th3Moody Jan 2, 2023 Author

Th3Moody
Jan 2, 2023

Replies: 1 comment 3 replies

titu1994
Jan 2, 2023
Maintainer

Th3Moody Jan 2, 2023
Author

titu1994 Jan 2, 2023
Maintainer

Th3Moody Jan 2, 2023
Author