Questions on Fine-tuning Conformer CTC model with limited data(~20hrs) #5651

ti3x · 2022-12-16T04:27:26Z

ti3x
Dec 16, 2022

I am dealing with some domain-specific dataset that may have challenges with high out-of-box WER so I want to experiment with some hours of audio-text data we have and see if that can make a positive impact. We are not quite there yet to worry about issues like "overfitting", but we want to see that the Conformer-CTC can adapt to our audio samples that are somewhat noisy and include very fast speech patterns.

I am trying with riva + tao examples, like the speechtotext_conformer_notebook_vv1.0 notebook. Then I learned more about fine-tuning from existing papers and post like #4183. So I have some additional questions since this is my first time dealing with fine-tuning asr models.

If the audio transcriptions don't share anything in common with Librispeech dataset, do we have to create our own tokenizer? I see create tokenizer as first step in those TAO notebooks, but it is not mentioned in Riva's docs, so I am not sure if we have to use our own text corpus to create a custom tokenizer. We are not doing transfer learning to a different language, just very domain-specialized English.
Can we do techniques like "freeze the encoder and finetuned just the decoder" with TAO or that has to be done with NeMo?
Same question with using adapters, the example seems to be NeMo-only.
If that's the case that we have to use NeMo for advanced fine-tuning, is there an example to go from .nemo to .riva?

From reading more about adapters, it seems like that may work best if there are new samples of audio without changes in text structure. That sounds like that it is great for speaker adaptation but not necessarily a replacement of fine-tuning if both audio+text structures are too different from training data. Is that correct understanding of adapters? In our case, we have a very specialized domain to fine-tune, it differs from standard text in librispeech dataset, so it sounds like that it won't be a good idea to try adapters first, we'd still have to fine-tune the model with sufficient amount of audio+text data first.

titu1994 · 2022-12-16T07:19:45Z

titu1994
Dec 16, 2022
Maintainer

If the audio transcriptions don't share anything in common with Librispeech dataset, do we have to create our own tokenizer?

If the transcripts are only lower case 26 char alphabet, space and apostrophe, then there is no need to change the Tokenizer. However if there are additional tokens in your transcripts, and they cannot be normalized then yes you must construct a new Tokenizer. Id still encourage a new Tokenizer to be of same size as the original model.

I see create tokenizer as first step in those TAO notebooks, but it is not mentioned in Riva's docs, so I am not sure if we have to use our own text corpus to create a custom tokenizer. We are not doing transfer learning to a different language, just very domain-specialized English.

Tokenizers are built usually from a text corpus, or at minimum from the text in the ASR transcripts. When you do create a new Tokenizer, make sure it's vocab size matches the models original vocab size. Transfer learning is good, across languages, across accents, across basically every scenario. We always recommend doing transfer learning by loading both the weights of the encoder and the decoder.

Can we do techniques like "freeze the encoder and finetuned just the decoder" with TAO or that has to be done with NeMo?

I dunno much about the tao toolkit, but it may not support this. It is not a requirement btw, just useful when you are severely constrained in compute or dataset size.

Same question with using adapters, the example seems to be NeMo-only.

Adapters are NeMo only for the moment. Btw, if you don't need to change the tokenizer, we recently added the ability to use attention based adapters to the main branch, which might be very useful for noisy speech adaptation. The example script and it's config have some info but I can add additional info if needed.

If that's the case that we have to use NeMo for advanced fine-tuning, is there an example to go from .nemo to .riva?

You can take a compatible Nemo file and pass it through nemo2riva to get a Riva compatible file. I believe the Riva documentation has info regarding it's usage.

From reading more about adapters, it seems like that may work best if there are new samples of audio without changes in text structure. That sounds like that it is great for speaker adaptation but not necessarily a replacement of fine-tuning if both audio+text structures are too different from training data. Is that correct understanding of adapters?

You can think of adapter as a dynamic model surgery, such that your original model is not harmed. Like any and all DL models, adapters can be made to work with any kind of audio. The only hard constraint is that for maximum benefit, you should not change the decoder / Tokenizer.

We have done adapter studies with ASR on domain shift, 30 mins of specific vocabulary, 15 mins if music, etc. They are very versatile, but when you have sufficient data (hundreds of hours of speech), you'll get more value by finetuning the model.

In our case, we have a very specialized domain to fine-tune, it differs from standard text in librispeech dataset, so it sounds like that it won't be a good idea to try adapters first, we'd still have to fine-tune the model with sufficient amount of audio+text data first.

I would suggest to try it since it usually takes just a few minutes if training if you have the dataset ready and the Tokenizer is supported.

There are SSL papers that adapt the decoder + encoder adapter on entirely new language for ASR - treat adapters as small model that learns to modify a large model to do something useful in new domain.

0 replies

ti3x · 2022-12-16T17:15:07Z

ti3x
Dec 16, 2022
Author

@titu1994 Great answer, thank you. It sounds like we won't need a new tokenizer if our text does not have new special characters. I likely will try to see if I can get Adapter working. I got a basic dataset together for the Tao example, with wavs and train_manifest etc. There were some new questions that arise from fine-tuning:

I hit out of memory error if I don't filter out clips with text longer than 57 characters. I see there is a flag for max_duration on tao, not sure if there is some parameters to adjust and hard limitations on the training data's audio duration + text. I am using g5.xlarge with about 24g GPU. It would be good to know for sure as we plan to collect more training data. We simply use recorded clips at the moment, so they may vary in length. It seems that we might have to cut certain longer audio clips into smaller pieces before putting into the training dataset.
For fine-tuning, how many epochs would you suggest with batch_size 32?
After some reading, there is also a concept of labeled vs. unlabeled data for ASR fine-tuning. Currently we have audio clips that have human-curated transcriptions, even though we don't have any timing information of the words in the transcription. I wonder whether that is defined as "labeled data". We also have a larger set of audio clips that may not have transcription data at all. Would audios without transcription be usable for NeMo Conformer fine-tuning?

There are some follow-up questions for the answers

The only hard constraint is that for maximum benefit, you should not change the decoder / Tokenizer. "Change the decoder" as in using a different decoder for the model, not about updating the decoder weights, right?
Could you give me more pointers in NeMo about how to "freeze the encoder and finetuned just the decoder"? We technically have very limited sized dataset for the time being, especially when it comes to speaker adaptation. In order to ensure quality in production, we may need options to fine-tune on very limited new-customer audio data when deployed-model does not perform well, for example.
In the Adapter notebook,

For more realistic cases, we usually observe the range of 10-30% WER for out-of-domain speech. https://colab.research.google.com/github/NVIDIA/NeMo/blob/main/tutorials/asr/asr_adapters/ASR_with_Adapters.ipynb#scrollTo=dm-qqTdZDUlZ
We are at high-end of that estimate with around 30%+ WER for Conformer-CTC-Large, so it feels like Adapters may be an interesting stopgap experiment but we may need bigger audio-transcription dataset in order to get the WER down to under 10%.

0 replies

titu1994 · 2022-12-18T08:49:38Z

titu1994
Dec 18, 2022
Maintainer

Number of characters is not a limiter usually, your audio must be too long. Youll have to presegment them. We recommend at maximum 20 second long clips.

Epochs is not a good metric - fine-tune for roughly 1000 to 5000 steps dependencies on your dataset. If it doesn't overfit, use more steps.

You can use unlabeled data (which must be presegmented) by doing ssl but it's only significantly useful if the amount of unlabeled data is large. You can use CTC segmentation toolkit In NeMo to attempt to segment long audio.

Labeled data for ASR is simply an audio file and the corresponding text. It does not require timestamps usually.

If you are using adapters, then you must not change the decoder / Tokenizer.

I would not suggest freezing the encoder. Your task is primarily adaptation to difficult speech, your encoder should be updated v

The models accuracy in the end will depend on the quality of audio and the exact preciseness of the text transcripts used for training. ASR is highly dependent on good audio and text pairs and will not be able to miraculous get good results if either of those two are not provided

3 replies

ti3x Dec 18, 2022
Author

Yeah, I also realized that max_duration is a better metric shortly after, since the rate of speech may vary. I tried 20s, but it still triggers out of memory on g5.xlarge with 23028MiB GPU ram(NVIDIA A10G). Were you able to run 20s duration clips on V100s?

The CTC segmentation tool is a cool idea, thanks for sharing it. I was also thinking that we might need to leverage ASR to segment our existing dataset. There are some existing challenges with fast speech that we have that sometimes the ASR result will just drop half of the clip's content in expected transcription so I don't know how well it will work out yet. I might need to leverage custom pre-trained model in order to make it work consistently for segmentation.

I ran a small dataset with 4 hours to test how things go with the Adapters example on Conformer-CTC-large, with max epoch set to 100 and infinite steps. After 25k steps, the val_wer is still fairly high, around 0.74 - 0.75, trending upward to 0.76. The training_batch_wer moved down from 0.85 to under 0.65. I don't know if an increasing val_wer means that the speech is simply too difficult for the model to handle or I need a lot more training data to improve the wer.

I haven't found any public datasets with fast or atypical speech or research papers on this topic. It feels that we may have to try something to slow the input clip down without losing quality. I am also trying to find if there are existing tools for gauging on phones per sec metrics on wav files, 'cause I may have to start with slower clips or even TTS clips that we have more control over the rate of speech.

titu1994 Dec 18, 2022
Maintainer

We use batch size of 8 usually with grad accumulation on V100.

If the wer stays that high, it might be that the audio quality is a factor rather than just speed, though it might also just be the speed. I suppose adapter might not be a good enough solution for such a case, and full network should be finetuned.

ti3x Dec 18, 2022
Author

I see, the default batch_size seems to be at 32. I will try again. I am currently at under-10-sec clips with batch-size 32.

Train config : 
    manifest_filepath: /home/ubuntu/nemo/data/bench/nemo_train_manifest.json
    sample_rate: 16000
    batch_size: 32
    shuffle: true
    num_workers: 4
    pin_memory: true
    use_start_end_token: false
    trim_silence: false
    max_duration: 20.0
    min_duration: 0.1
    ```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Questions on Fine-tuning Conformer CTC model with limited data(~20hrs) #5651

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Questions on Fine-tuning Conformer CTC model with limited data(~20hrs) #5651

Uh oh!

ti3x Dec 16, 2022

Replies: 3 comments · 3 replies

Uh oh!

titu1994 Dec 16, 2022 Maintainer

Uh oh!

Uh oh!

ti3x Dec 16, 2022 Author

Uh oh!

titu1994 Dec 18, 2022 Maintainer

Uh oh!

ti3x Dec 18, 2022 Author

Uh oh!

titu1994 Dec 18, 2022 Maintainer

Uh oh!

Uh oh!

ti3x Dec 18, 2022 Author

ti3x
Dec 16, 2022

Replies: 3 comments 3 replies

titu1994
Dec 16, 2022
Maintainer

ti3x
Dec 16, 2022
Author

titu1994
Dec 18, 2022
Maintainer

ti3x Dec 18, 2022
Author

titu1994 Dec 18, 2022
Maintainer

ti3x Dec 18, 2022
Author