Training/Fine-tuning of a Swedish transducer stt model #5418

Oscaarjs · 2022-11-15T08:52:04Z

Oscaarjs
Nov 15, 2022

I'm in the process of using the NeMo framework to fine-tune a pretrained english model for swedish.
I've created a dataset that have the following characteristics:

~400 hours of labeled data
max dur=20.0 sec, min dur=0.25 sec
Text "normalized" to only contain lower-case a-z and the swedish characters "å", "ä", and "ö".
train split with ~314 000 files
dev split with ~3200 files
~89 000 unique words in the dataset

The train and dev splits are tarred with following config for convert_to_tarred_audio_dataset.py :

num_shards 1024 (train), 32 (dev)
max_duration 20.0
min_duration 0.25
shuffle
sort_in_shards

The tokenizer used currently is a "SPE" tokenizer made with process_asr_text_tokenizer.py with:

vocab_size=1024
tokenizer="spe"
spe_type="unigram"
spe_character_coverage=1.0

The idea is to use the pretrained STT En Conformer-Transducer Large model and fine-tune it for Swedish. The way I've set up the training so far is:

trainer = ptl.Trainer(devices=1, 
                      accelerator='gpu', 
                      max_epochs=300, 
                      accumulate_grad_batches=4,
                      enable_checkpointing=False,
                      logger=False,
                      log_every_n_steps=250
                      check_val_every_n_epoch=1)

I've taken the base config conformer_transducer_bpe.yaml from https://raw.githubusercontent.com/NVIDIA/NeMo/stable/examples/asr/conf/conformer/conformer_transducer_bpe.yaml


base_cfg = OmegaConf.load('conformer_transducer_bpe.yaml')
cfg = copy.deepcopy(base_cfg.model)


with open_dict(cfg):
    
    cfg.train_ds.is_tarred=True
    cfg.train_ds.tarred_audio_filepaths=TRAIN_FILEPATHS
    cfg.train_ds.manifest_filepath=TRAIN_MANIFEST
    cfg.train_ds.labels = TOKENIZER
    cfg.train_ds.normalize_transcripts = False
    cfg.train_ds.batch_size = 8
    cfg.train_ds.num_workers = 2
    cfg.train_ds.pin_memory = True
    cfg.train_ds.trim_silence = True
    cfg.train_ds.sample_rate = 16000


    cfg.tokenizer.dir = TOKENIZER
    cfg.tokenizer.type = TOKENIZER_TYPE 
    
    cfg.validation_ds.is_tarred=True
    cfg.validation_ds.tarred_audio_filepaths=VAL_FILEPATHS
    cfg.validation_ds.manifest_filepath = VAL_MANIFEST
    cfg.validation_ds.labels = TOKENIZER
    cfg.validation_ds.normalize_transcripts = False
    cfg.validation_ds.batch_size = 8
    cfg.validation_ds.num_workers = 2
    cfg.validation_ds.pin_memory = True
    cfg.validation_ds.trim_silence = True
    cfg.validation_ds.sample_rate = 16000

    cfg.optim.name="adamw" 
    cfg.optim.sched.d_model=512
    
    cfg.spec_augment.freq_masks = 2 
    cfg.spec_augment.time_masks = 10 
    cfg.spec_augment.freq_width = 27
    cfg.spec_augment.time_width = 0.05

The set-up is further done as:

model = nemo_asr.models.ASRModel.from_pretrained('stt_en_conformer_transducer_large', override_config_path=cfg, trainer=trainer, map_location='cpu')
model.change_vocabulary(new_tokenizer_dir=TOKENIZER, new_tokenizer_type=TOKENIZER_TYPE)
model.setup_training_data(cfg.train_ds)
model.setup_multiple_validation_data(cfg.validation_ds)
model.setup_optimization(optim_config=cfg.optim)
model.cfg = model._cfg
trainer.fit(model)

Now to the actual questions:
As I'm quite limited on computational resources I'd like to make sure that I start the training in an "as reasonable as possible" configuration. Thus I'm wondering whether the parameters I have are reasonable to use.

For instance;

1. Should the optimizer config be changed from the baseconfig?
2. Should the spec_augment be modified?
3. Is this config for the tokenizer reasonable?
4. Is the effective batch-size of 32 ( 4 * 8 with grad accum) reasonable?
5. Is everything set-up in the correct order?
6. Any rough estimate on the amount of epochs generally needed?
7. Any other things that I might have missed or done incorrectly

I'd highly appreciate any help & pointers.

Br

Answered by titu1994

Nov 15, 2022

Hi, thanks for the incredibly detailed discussion ! So first off let's start of with some important considerations -

You have Tokenizer, are referring to bpe config, but are manually inserting parts of config from character based models config. Please refer to the bpe / subword config only when using Tokenizers. Mixing the two will not work and may silently cause major issues.
validation set cannot be used with tarred dataset. It is also quite wasteful to have over 10 hours of validation data, cause you'll mostly cover the whole vocabulary with that much. Also, NeMo does not support Val and test data loaders being tarred datasets cause they drop samples which would make results incomp…

View full answer

Oscaarjs · 2022-11-15T09:00:42Z

Oscaarjs
Nov 15, 2022
Author

I'm tagging you @titu1994 as I've seen you give great answers on a lot of other similar questions, thanks in advance!

0 replies

titu1994 · 2022-11-15T09:45:46Z

titu1994
Nov 15, 2022
Maintainer

Hi, thanks for the incredibly detailed discussion ! So first off let's start of with some important considerations -

You have Tokenizer, are referring to bpe config, but are manually inserting parts of config from character based models config. Please refer to the bpe / subword config only when using Tokenizers. Mixing the two will not work and may silently cause major issues.
validation set cannot be used with tarred dataset. It is also quite wasteful to have over 10 hours of validation data, cause you'll mostly cover the whole vocabulary with that much. Also, NeMo does not support Val and test data loaders being tarred datasets cause they drop samples which would make results incomparable to academia.
Data split looks great! Number of tarfiles makes sense given size of the dataset.
Tokenizer looks good. What you need to make sure is that the vocab size matches up with the pretrained model. You might want to try loading up the base model and checking it's vocab size. Some Conformers use smaller vocab size - 128 or 256 and you'll get the best results by loading all of the weights of the model by matching up the vocab size.
Make sure you're using exp manager with the two resume flags set (they are false hy default in the configs). What that allows us if for some reason training halts, as long as your experiment directory remains the same and has checkpoint in it, NeMo can resume training until completion.
you should not need to provide "label" in train_ds, validation_ds and test_ds for Subword based models.
trim silence should be disabled. We find it negatively impacts model inference sometimes c
validation ds can't have tarred flags. Please refer to the config shown in the GitHub, and use only flags that are supported there. Also don't use "normalize_transcript" etc.

Now onto the setup code

you are changing the Tokenizer but not loading the weights of the original model ! If you match up the Tokenizer size, this will phenomenally speed up convergence and lead to better wer overall, even if the base language is completely unrelated to the new one ! Refer to the steps here - https://colab.research.google.com/gist/titu1994/080c5387c4c02b41ce79dd4405d87104

It's not a full tutorial but it has most of the important steps.

the "default" number of warmup steps in the optimizer is 15000. You have in your training script 300 epochs given nearly 300,000 files so at batch size 8, tbags roughly 300 * 300000 / 8 /4 ~ 2.5 Million steps.

It is simply too vast for 1 gpu. I would first do short experiments - maybe 5-10 epochs ~ 50,000-100,000 steps. That should be a reasonable baseline for training. As you'll see in the Hindi notebook, even 5000 steps can be the start of training - from 100% we're to 40% !

Spec augment is fine. Transducers do overfit fast and at worst, you can always shut it off if your 5-10 epoch results indicate results are ok given the amount of data.
Tokenizer follows the recommendations, only thing is check the tokenizer vocab size of the base model (you can print out the length of model.decoder.vocabulary) and match your Tokenizer to that to load all the params of the model.
The effective batch size is a little small. We generally aim for a minimum of 256 for any realistic training - but that's when we train from scratch. During finetuning, as long as you load all the params of the model after changing Tokenizer, you can use effective batch size of 32 reasonably well. (Hindi one used BS = 8 and still trains)
Yup everything seems to be setup in correct order. Good job with setting trained in the restore call. For reference you can visit the train flow here - https://github.com/NVIDIA/NeMo/tree/main/examples/asr/asr_transducer#model-execution-overview
Start off small - you need to get some baseline to be able to say whether more epochs helped or not. 5-10 is a reasonable start given your compute availability. If you see jumping from 5 epochs to 10 improved were by a significant amount (at least 1-2% absolute wer), yoh should take that pretrained checkpoint and then fine-tune it further for longer (use a smaller peak lr now - so it should be model.optim.lr=0.5 instead of 5 in the first run).
reach out to us if you face any problem. Given some time zone issues, delay will be there but we'll do our best to help you out.

8 replies

titu1994 Nov 18, 2022
Maintainer

Your LR peak seems to be too high, I think best to use 10-15k warmup steps or to reduce LR to 1 or 2 so the peak doesn't go too much above 0.002.

Overall your model seems to be learning well. It is a transducer though, so it should do a lot better. Are you using spec augment? You could try disabling it (set both freq_masks and time_maska to 0) and see if you can overfit your dataset. As far as I can tell you will need longer training run but that should reduce your wer to around 10-15% at the end of training.

How noisy is your test set ? We have a lot of compute so we usually train for 100-200 epochs and process the dev / test set to be cleaned up before we do eval on it. Might be why you're seeing 25% wer. You could use the speech data explorer in Nemo to make out what mistakes are being made or if it's ground truth itself that is noisy.

Oscaarjs Nov 18, 2022
Author

Thanks a lot again for your quick replys!

I see, so far I've indeed used spec augment. Will try to turn it off. Btw would there be any reason to also turn off dropout in encoder/decoder or will this just likely result in overfitting to fast?

Further, would you recommend to re-train it from "scratch" (I.e not from the 10th epoch with the new LR) or to continue from the 10th epoch?

As another sanity check just so I understand things correctly; Below is the order in which I setup the model before training. Is it correct in terms of preserving the weights from the enc/dec/joint from the pre-trained en model?

model = nemo_asr.models.ASRModel.from_pretrained('stt_en_conformer_transducer_large', override_config_path=cfg, trainer=trainer, map_location='cpu')
pretrained_decoder = model.decoder.state_dict()
pretrained_joint = model.joint.state_dict()
model.change_vocabulary(new_tokenizer_dir=TOKENIZER, new_tokenizer_type=TOKENIZER_TYPE)
model.encoder.unfreeze()
logging.info("Model encoder has been un-frozen")
# Insert preserved model weights
model.decoder.load_state_dict(pretrained_decoder)
logging.info("Decoder shapes matched - restored weights from pre-trained model")
model.joint.load_state_dict(pretrained_joint)
logging.info("Joint shapes matched - restored weights from pre-trained model")
model.setup_training_data(cfg.train_ds)
model.setup_multiple_validation_data(cfg.validation_ds)
model.setup_multiple_test_data(cfg.test_ds)
model.setup_optimization(optim_config=cfg.optim)
model.spec_augmentation = model.from_config_dict(model.cfg.spec_augment)

I would say that the dataset in general is fairly clean but will try out the data explorer, thanks for the tip!

Best

titu1994 Nov 18, 2022
Maintainer

Dropout in encoder is important, conformer will overfit too rapidly without it. You could retrain from scratch if preferred, since 10 epochs one has not reached best possible wer.

When we train for 100-200 epochs, we use that checkpoint for the next generation as initialization.

Order of operations looks good

Oscaarjs Nov 25, 2022
Author

Thanks again!

I've now retrained it without augmentation and with more (15 000) warm-up steps. The training results are as follows;

Manages to reach a lower (lowest val_wer of 11.26%), do you have any other configuration changes in mind that may increase the results further or is it likely that there's not enough data?

Best

titu1994 Nov 25, 2022
Maintainer

Possibly noise in the data + insufficient data to train a model to full convergence. Note your train loss has reached near saturation and almost flatlined. You can do some tricks such as train it for more steps with this model as initialization and reduce peak lr to 0.0005 (you'll need to figure out the LR scaler for Noam).

Another option is to finetuned this model with light augmentation - maybe 2x freq and 2/5x time masks. Though given how low your model has reached yet train loss is not at 0 (its slope has flatlined well above 0.xx value, close to 10s mark) probably means there is some irreducible labeling error. So augmentation might help a little bit it won't fix the full issue.

Note that RNNT is very prone to overfit with perfect data for large models, and often reaches train loss close to 0 without specaug. So if your train loss were approaching 0 I'd strongly suggest augmentation, but in this case it might not help that much

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Training/Fine-tuning of a Swedish transducer stt model #5418

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 8 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Training/Fine-tuning of a Swedish transducer stt model #5418

Uh oh!

Oscaarjs Nov 15, 2022

Replies: 2 comments · 8 replies

Uh oh!

Oscaarjs Nov 15, 2022 Author

Uh oh!

titu1994 Nov 15, 2022 Maintainer

Uh oh!

titu1994 Nov 18, 2022 Maintainer

Uh oh!

Oscaarjs Nov 18, 2022 Author

Uh oh!

titu1994 Nov 18, 2022 Maintainer

Uh oh!

Oscaarjs Nov 25, 2022 Author

Uh oh!

Uh oh!

titu1994 Nov 25, 2022 Maintainer

Oscaarjs
Nov 15, 2022

Replies: 2 comments 8 replies

Oscaarjs
Nov 15, 2022
Author

titu1994
Nov 15, 2022
Maintainer

titu1994 Nov 18, 2022
Maintainer

Oscaarjs Nov 18, 2022
Author

titu1994 Nov 18, 2022
Maintainer

Oscaarjs Nov 25, 2022
Author

titu1994 Nov 25, 2022
Maintainer