TTS - Mixing datasets for FastPitch + HiFiGAN #3678
-
Hi All. At the end of the FastPitch_Finetuning.ipynb tutorial: https://github.com/NVIDIA/NeMo/blob/main/tutorials/tts/FastPitch_Finetuning.ipynb it suggest two different ways to improve performance, one of which is: Mix new speaker data with old speaker data but it isn't 100% clear how to do that. I found another FastPitch_Finetuning.ipynb here: Which is different than the other. Is this true? Or do I need something more that is not there?Why it is abandoned? Thanks! |
Beta Was this translation helpful? Give feedback.
Replies: 6 comments 9 replies
-
@Oktai15 any input ? |
Beta Was this translation helpful? Give feedback.
-
@redoctopus can you please comment on this? |
Beta Was this translation helpful? Give feedback.
-
I've updated the FastPitch fine-tuning notebook in #3954, hopefully the procedure for mixing datasets should be more clear. As mentioned in the tutorial, we don't explicitly have a code section where you can run this mixing in the notebook itself due to resource limitations on Colab. But if you have a local machine to use, you should be able generate a new manifest and re-run the fine-tuning commands with the few command line changes specified without too much trouble (I hope!). |
Beta Was this translation helpful? Give feedback.
-
Hello! |
Beta Was this translation helpful? Give feedback.
-
Yes, please see the finetuning tutorial mentioned in the original question, specifically the last section "Adding more data": https://github.com/NVIDIA/NeMo/blob/main/tutorials/tts/FastPitch_Finetuning.ipynb
We usually do a 1:1 ratio of old to new entries in the manifest, repeating the new entries as necessary to keep the dataset balanced, until you have the total number of samples that you want to train with. (See the tutorial for an example.) I don't think we've done any formal experiments on the effect of data mixing and scaling the dataset, but the mixing will definitely benefit smaller new speaker datasets more dramatically than for larger datasets. 15 minutes to about an hour of new speaker audio, the mixing helps a lot; after that it becomes less noticeable. And of course it also depends on whether you care about keeping quality of the original speaker's generated speech. |
Beta Was this translation helpful? Give feedback.
-
@gedefet could you tell me please about your experience - do you have success finetuning FastPitch+Hifigan with a small new dataset duration 15-30 minutes? I tryed to fine-tuning one FastPitch variation but the quality is not good enough and I suspect that fastpitch only gives good quality when training from scratch with large datasets. I would be grateful for any information about your experience. |
Beta Was this translation helpful? Give feedback.
I've updated the FastPitch fine-tuning notebook in #3954, hopefully the procedure for mixing datasets should be more clear.
As mentioned in the tutorial, we don't explicitly have a code section where you can run this mixing in the notebook itself due to resource limitations on Colab. But if you have a local machine to use, you should be able generate a new manifest and re-run the fine-tuning commands with the few command line changes specified without too much trouble (I hope!).