Replies: 3 comments 3 replies
-
If the transcripts are only lower case 26 char alphabet, space and apostrophe, then there is no need to change the Tokenizer. However if there are additional tokens in your transcripts, and they cannot be normalized then yes you must construct a new Tokenizer. Id still encourage a new Tokenizer to be of same size as the original model.
Tokenizers are built usually from a text corpus, or at minimum from the text in the ASR transcripts. When you do create a new Tokenizer, make sure it's vocab size matches the models original vocab size. Transfer learning is good, across languages, across accents, across basically every scenario. We always recommend doing transfer learning by loading both the weights of the encoder and the decoder.
I dunno much about the tao toolkit, but it may not support this. It is not a requirement btw, just useful when you are severely constrained in compute or dataset size.
Adapters are NeMo only for the moment. Btw, if you don't need to change the tokenizer, we recently added the ability to use attention based adapters to the main branch, which might be very useful for noisy speech adaptation. The example script and it's config have some info but I can add additional info if needed.
You can take a compatible Nemo file and pass it through nemo2riva to get a Riva compatible file. I believe the Riva documentation has info regarding it's usage.
You can think of adapter as a dynamic model surgery, such that your original model is not harmed. Like any and all DL models, adapters can be made to work with any kind of audio. The only hard constraint is that for maximum benefit, you should not change the decoder / Tokenizer. We have done adapter studies with ASR on domain shift, 30 mins of specific vocabulary, 15 mins if music, etc. They are very versatile, but when you have sufficient data (hundreds of hours of speech), you'll get more value by finetuning the model.
I would suggest to try it since it usually takes just a few minutes if training if you have the dataset ready and the Tokenizer is supported. There are SSL papers that adapt the decoder + encoder adapter on entirely new language for ASR - treat adapters as small model that learns to modify a large model to do something useful in new domain. |
Beta Was this translation helpful? Give feedback.
-
@titu1994 Great answer, thank you. It sounds like we won't need a new tokenizer if our text does not have new special characters. I likely will try to see if I can get Adapter working. I got a basic dataset together for the Tao example, with wavs and train_manifest etc. There were some new questions that arise from fine-tuning:
There are some follow-up questions for the answers
|
Beta Was this translation helpful? Give feedback.
-
Number of characters is not a limiter usually, your audio must be too long. Youll have to presegment them. We recommend at maximum 20 second long clips. Epochs is not a good metric - fine-tune for roughly 1000 to 5000 steps dependencies on your dataset. If it doesn't overfit, use more steps. You can use unlabeled data (which must be presegmented) by doing ssl but it's only significantly useful if the amount of unlabeled data is large. You can use CTC segmentation toolkit In NeMo to attempt to segment long audio. Labeled data for ASR is simply an audio file and the corresponding text. It does not require timestamps usually. If you are using adapters, then you must not change the decoder / Tokenizer. I would not suggest freezing the encoder. Your task is primarily adaptation to difficult speech, your encoder should be updated v The models accuracy in the end will depend on the quality of audio and the exact preciseness of the text transcripts used for training. ASR is highly dependent on good audio and text pairs and will not be able to miraculous get good results if either of those two are not provided |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I am dealing with some domain-specific dataset that may have challenges with high out-of-box WER so I want to experiment with some hours of audio-text data we have and see if that can make a positive impact. We are not quite there yet to worry about issues like "overfitting", but we want to see that the Conformer-CTC can adapt to our audio samples that are somewhat noisy and include very fast speech patterns.
I am trying with riva + tao examples, like the speechtotext_conformer_notebook_vv1.0 notebook. Then I learned more about fine-tuning from existing papers and post like #4183. So I have some additional questions since this is my first time dealing with fine-tuning asr models.
From reading more about adapters, it seems like that may work best if there are new samples of audio without changes in text structure. That sounds like that it is great for speaker adaptation but not necessarily a replacement of fine-tuning if both audio+text structures are too different from training data. Is that correct understanding of adapters? In our case, we have a very specialized domain to fine-tune, it differs from standard text in librispeech dataset, so it sounds like that it won't be a good idea to try adapters first, we'd still have to fine-tune the model with sufficient amount of audio+text data first.
Beta Was this translation helpful? Give feedback.
All reactions