Skip to content

[Bug]  #127

@mushahid-intesum

Description

@mushahid-intesum

Describe the bug

Error when starting model training from checkpoint in Coqui TTS
When saved as a checkpoint for later training, the last training and eval losses are saved as in dict. When training from scratch, the last training loss is saved as a float. Hence, starting from a checkpoint doesn't run the code properly

To Reproduce

  1. Train a model in Coqui TTS using trainer
  2. Once a checkpoint for best model is saved, stop the training
  3. Set the checkpoint folder as continue path in the trainer class
  4. Restart from the checkpoint

https://colab.research.google.com/drive/1OwemROn306_JIYASjx39d52eXFHS1O_u

Expected behavior

The training should stop

Logs

Traceback (most recent call last):
  File "/mnt/Work/anaconda3/envs/tts-env/lib/python3.10/site-packages/trainer/trainer.py", line 1808, in fit
    self._fit()
  File "/mnt/Work/anaconda3/envs/tts-env/lib/python3.10/site-packages/trainer/trainer.py", line 1771, in _fit
    self.save_best_model()
  File "/mnt/Work/anaconda3/envs/tts-env/lib/python3.10/site-packages/trainer/utils/distributed.py", line 35, in wrapped_fn
    return fn(*args, **kwargs)
  File "/mnt/Work/anaconda3/envs/tts-env/lib/python3.10/site-packages/trainer/trainer.py", line 1893, in save_best_model
    self.best_loss = save_best_model(
  File "/mnt/Work/anaconda3/envs/tts-env/lib/python3.10/site-packages/trainer/io.py", line 183, in save_best_model
    if current_loss < best_loss:
TypeError: '<' not supported between instances of 'float' and 'dict'

Environment

-torch: 2.1.0
-trainer: 0.0.31
-python: 3.10
-OS: Endeavor OS
-cuda: cuda_12.2.r12.2
-GPU: NVIDIA RTX 3060
-pytorch installation: pip

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions