TTS - Mixing datasets for FastPitch + HiFiGAN #3678

gedefet · 2022-02-15T13:02:42Z

gedefet
Feb 15, 2022

Hi All.

At the end of the FastPitch_Finetuning.ipynb tutorial:

https://github.com/NVIDIA/NeMo/blob/main/tutorials/tts/FastPitch_Finetuning.ipynb

it suggest two different ways to improve performance, one of which is:

Mix new speaker data with old speaker data
We recommend to train fastpitch using both old speaker data (LJSpeech in this notebook) and the new speaker data. In this case, please modify the .json when finetuning fastpitch to include speaker information:

but it isn't 100% clear how to do that.

I found another FastPitch_Finetuning.ipynb here:

https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/tts/FastPitch_Finetuning.ipynb

Which is different than the other.
It seems that it has two functions defined to do the mixing, but it seems that this colab is abandoned. Nevertheless, I think that has very useful code, and that has tackled this part of mixing datasets.

Is this true? Or do I need something more that is not there?Why it is abandoned?

Thanks!

Answered by redoctopus

Apr 8, 2022

I've updated the FastPitch fine-tuning notebook in #3954, hopefully the procedure for mixing datasets should be more clear.

As mentioned in the tutorial, we don't explicitly have a code section where you can run this mixing in the notebook itself due to resource limitations on Colab. But if you have a local machine to use, you should be able generate a new manifest and re-run the fine-tuning commands with the few command line changes specified without too much trouble (I hope!).

View full answer

titu1994 · 2022-02-15T17:27:53Z

titu1994
Feb 15, 2022
Maintainer

@Oktai15 any input ?

0 replies

okuchaiev · 2022-04-06T00:07:28Z

okuchaiev
Apr 6, 2022
Collaborator

@redoctopus can you please comment on this?

0 replies

redoctopus · 2022-04-08T19:32:25Z

redoctopus
Apr 8, 2022
Collaborator

I've updated the FastPitch fine-tuning notebook in #3954, hopefully the procedure for mixing datasets should be more clear.

As mentioned in the tutorial, we don't explicitly have a code section where you can run this mixing in the notebook itself due to resource limitations on Colab. But if you have a local machine to use, you should be able generate a new manifest and re-run the fine-tuning commands with the few command line changes specified without too much trouble (I hope!).

0 replies

arinaruck · 2022-05-17T11:39:49Z

arinaruck
May 17, 2022

Hello!
Is there any info on how to select the amount of "old speaker" data?
Should the totalling duration be the same or should the amount of the old data me scaled proportionally to the amount of the new data?
I have almost 3 hours of the new speaker, so I am not able to provide 60 times that of the old speaker data
And does the data mixing fine-tuning setup scale for the larger amounts of data or is it beneficial only when very limited data of the new speaker is accessible?
@redoctopus, maybe you can comment on this?
Thanks

0 replies

redoctopus · 2022-05-17T18:12:26Z

redoctopus
May 17, 2022
Collaborator

Yes, please see the finetuning tutorial mentioned in the original question, specifically the last section "Adding more data": https://github.com/NVIDIA/NeMo/blob/main/tutorials/tts/FastPitch_Finetuning.ipynb

5 hours of data from the old speaker should be sufficient for training; it's up to you how much data from the old speaker to use in validation.

We usually do a 1:1 ratio of old to new entries in the manifest, repeating the new entries as necessary to keep the dataset balanced, until you have the total number of samples that you want to train with. (See the tutorial for an example.)

I don't think we've done any formal experiments on the effect of data mixing and scaling the dataset, but the mixing will definitely benefit smaller new speaker datasets more dramatically than for larger datasets. 15 minutes to about an hour of new speaker audio, the mixing helps a lot; after that it becomes less noticeable. And of course it also depends on whether you care about keeping quality of the original speaker's generated speech.

3 replies

gedefet May 17, 2022
Author

@redoctopus , why do you match the number of samples instead of the total duration of the datasets? I had better results matching duration instead of # of samples. I have even better results decreasing the total old data wrt the new speaker, even with little new speaker data (~25min)

redoctopus May 17, 2022
Collaborator

Duration is certainly more precise! Interleaving them for an equal number of samples was just simpler to write.

I have even better results decreasing the total old data wrt the new speaker, even with little new speaker data (~25min)

This is interesting, thanks for bringing it up. We haven't done much in the way of experimentation for this--mostly we were comparing "no mixing (new speaker data only)" vs. "1:1 mixing (new and old speaker data)" and saw that the mixing was noticeably better for most of our new speakers (i.e. 15m, 30m, 60m of non-professionally recorded audio, varying sound quality). As we got to 60m or so of new speaker data, the difference was much less, at least to my non-professional ear. It would be interesting to see (hear?) a detailed analysis of mixing ratios and resulting speaker quality. I wonder if the recording quality of the new speaker has some effect on the benefit, too...

gedefet May 18, 2022
Author

@redoctopus in my case, I'm using spotify podcasts recorded in a studio, and then processed with an audio tool. But there is certainly a distance to a "professional" recording...

I also use a smaller subset for HiFiGAN, reaching a point with good quality in the sense of audio quality

helgadnestr · 2023-04-10T14:34:11Z

helgadnestr
Apr 10, 2023

@gedefet could you tell me please about your experience - do you have success finetuning FastPitch+Hifigan with a small new dataset duration 15-30 minutes? I tryed to fine-tuning one FastPitch variation but the quality is not good enough and I suspect that fastpitch only gives good quality when training from scratch with large datasets. I would be grateful for any information about your experience.

6 replies

helgadnestr Apr 11, 2023

@gedefet Did you manage to get such the perfect dataset? Could you send me some examples of records from the dataset and generated records trained on it on my email? and how duration of this new dataset did you have when was training?
If I understand correctly, this is a drawback FastPitch - it needs perfect records in the dataset..? In one project (coqui) two networks FastPitch and Vits were trained on the dataset VCTK (each speaker recorded about 400 examples and NOT perfect quality), generated records of Vits were very well, but FastPitch generated records were bad, they had a lot of metal and synthetics.

gedefet Apr 11, 2023
Author

The quality should be like LJSpeech or HiFiTTS dataset. I recorded it on a professional studio

helgadnestr Apr 11, 2023

@gedefet thank you, I understood about quality dataset.. but may be you can sent me examples generated records, it's important for me to know that it sound like human after fine-tuning with small good dataset.

gedefet Apr 11, 2023
Author

Unfortunately I can't, as they are private, non-public samples

helgadnestr Apr 12, 2023

No problem, maybe someone else can answer me.. thanks for the answers!

TTS - Mixing datasets for FastPitch + HiFiGAN #3678

Uh oh!

Uh oh!

gedefet Feb 15, 2022

Replies: 6 comments · 9 replies

Uh oh!

titu1994 Feb 15, 2022 Maintainer

Uh oh!

okuchaiev Apr 6, 2022 Collaborator

Uh oh!

redoctopus Apr 8, 2022 Collaborator

Uh oh!

arinaruck May 17, 2022

Uh oh!

redoctopus May 17, 2022 Collaborator

Uh oh!

Uh oh!

gedefet May 17, 2022 Author

Uh oh!

redoctopus May 17, 2022 Collaborator

Uh oh!

gedefet May 18, 2022 Author

Uh oh!

helgadnestr Apr 10, 2023

Uh oh!

helgadnestr Apr 11, 2023

Uh oh!

gedefet Apr 11, 2023 Author

Uh oh!

helgadnestr Apr 11, 2023

Uh oh!

gedefet Apr 11, 2023 Author

Uh oh!

helgadnestr Apr 12, 2023

gedefet
Feb 15, 2022

Replies: 6 comments 9 replies

titu1994
Feb 15, 2022
Maintainer

okuchaiev
Apr 6, 2022
Collaborator

redoctopus
Apr 8, 2022
Collaborator

arinaruck
May 17, 2022

redoctopus
May 17, 2022
Collaborator

gedefet May 17, 2022
Author

redoctopus May 17, 2022
Collaborator

gedefet May 18, 2022
Author

helgadnestr
Apr 10, 2023

gedefet Apr 11, 2023
Author

gedefet Apr 11, 2023
Author