-
Hello guys, My experiment: Second one, I made the same exact experiment but left the model to train for 20 epochs without any pausing/resuming. For the first 10 epochs the loss dropped from 458 to 95 (pretty much the same as in the first experiment). The second 10 epochs the loss dropped from 95 to 82 (and now comes the difference). ============
but nothing fixed the problem ============ ============
My code: import nemo.collections.asr as nemo_asr
import pytorch_lightning as pl
from pytorch_lightning.strategies import *
from omegaconf import OmegaConf
from nemo.utils import exp_manager
from os.path import join
import copy
##########################################################
NUM_EPOCHS = 20
MAIN_DIR = "xxxxx/xxxxx/xxxxxx"
##########################################################
config = OmegaConf.load(join(MAIN_DIR, "conf/citrinet_512.yaml"))
config.model.tokenizer.dir = join(MAIN_DIR, "tokenizers/tokenizer_spe_bpe_v512")
config.model.tokenizer.type = "bpe"
# AIC + Train_Data
config.model.train_ds.manifest_filepath = join(MAIN_DIR, "manifests/ready_manifests/train-16secs.json")
config.model.validation_ds.manifest_filepath = join(MAIN_DIR, "manifests/ready_manifests/validation.json")
config.model.test_ds.manifest_filepath = join(MAIN_DIR, "manifests/ready_manifests/test.json")
trainer = pl.Trainer(
devices= 4, # number of gpus
max_epochs= NUM_EPOCHS,
max_steps= -1, # computed at runtime if not set
num_nodes= 1,
accelerator= 'gpu',
strategy='ddp',
accumulate_grad_batches= 1,
enable_checkpointing= False, # Provided by exp_manager
logger= False, # Provided by exp_manager
log_every_n_steps= 100, # Interval of logging.
val_check_interval= 1.0, # Set to 0.25 to check 4 times per epoch, or an int for number of iterations
check_val_every_n_epoch= 1,
precision= 32,
sync_batchnorm= False,
benchmark= False # needs to be false for models with variable-length speech input as it slows down training)
)
# model = nemo_asr.models.EncDecCTCModelBPE.restore_from("models/citrinet512_bpe512_ep10.nemo")
model = nemo_asr.models.EncDecCTCModelBPE(cfg=config.model)
# checkpoint_path = "/home/m.hossam/NeMo_projects/asr_tests_by7oda/wael_data_citri512_exp_3_node12/Citrinet-512-8x-Stride/checkpoints/Citrinet-512-8x-Stride--val_wer=0.7471-epoch=30-last.ckpt"
# model = nemo_asr.models.EncDecCTCModelBPE.load_from_checkpoint(checkpoint_path=checkpoint_path)
print(model.summarize())
# tried re-setting the datasets
# print("Setting train data...")
# model.setup_training_data(config.model.train_ds)
# print("Setting validation data...")
# model.setup_validation_data(config.model.validation_ds)
# print("Setting test data...")
# model.setup_test_data(config.model.test_ds)
# tried to change the optim parameters
# print("Setting optimization params...")
# new_optim = copy.deepcopy(config.model.optim)
# new_optim.lr = 0.09
# model.setup_optimization(new_optim)
# print("Set trainer")
# model.set_trainer(trainer)
print("Initialize Experiment Manager...")
config.exp_manager.exp_dir = "citrinet512bpe512_2"
config.exp_manager.resume_if_exists = "true"
config.exp_manager.resume_ignore_no_checkpoint= "true"
experiment_manager = exp_manager.exp_manager(trainer=trainer, cfg=config.exp_manager)
print("Training...")
trainer.fit(model=model)
# Save model
print("Saving model...")
model.save_to(str("models/citrinet512_bpe512_2_ep"+str(NUM_EPOCHS)+".nemo"))
NeMo-exp-1-10ep-1.log |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 3 replies
-
I.e. you set your model to train for 10 epochs. Let that one finish, then use exp manager with that for further 10 epochs? This will not work. Take a look at the LR of the second run - it should be close to 0 (or whatever min_lr you set). The resume functionality is meant to be used from the beginning of the training run - and is only useful if you stop in the middle. It cannot be used to continue an already finished run unless you were super careful with param and optimization settings (we don't encourage that kind of training). The second run worked cause that's how exp manager is supposed to be used - a fixed large number of steps at the beginning and you run multiple chained runs until you finish it. |
Beta Was this translation helpful? Give feedback.
I.e. you set your model to train for 10 epochs. Let that one finish, then use exp manager with that for further 10 epochs?
This will not work. Take a look at the LR of the second run - it should be close to 0 (or whatever min_lr you set).
The resume functionality is meant to be used from the beginning of the training run - and is only useful if you stop in the middle. It cannot be used to continue an already finished run unless you were super…