Is it possible to adjust the speaking rate in speech synthesis? #2502

godspirit00 · 2021-07-17T12:22:39Z

godspirit00
Jul 17, 2021

Hello,

I see in the DeepLearningExamples repo that fastpitch allows adjusting the speaking rate by passing a flag pace to inference.py.
But I don't see similar flags to adjust the speaking rate in NeMo's doc.
So is it possible to adjust the speaking rate when using the FastPitch+HiFiGAN E2E model in NeMo?

Thanks!

Answered by blisc

Aug 23, 2021

A notebook showing how to adjust speaking rate and pitch for fastpitch has been added to the main branch: https://github.com/NVIDIA/NeMo/blob/main/tutorials/tts/2_Inference_DurationPitchControl.ipynb

Colab link: https://colab.research.google.com/github/NVIDIA/NeMo/blob/main/tutorials/tts/2_Inference_DurationPitchControl.ipynb

View full answer

blisc · 2021-08-13T14:12:24Z

blisc
Aug 13, 2021
Maintainer

Yes, it is possible to adjust the speaking rate of duration predictor models (ie most of the fast- models). It is unforunately undocumented. We plan on adding a notebook to explain how to do so soon.

Technically, all the duration predictor does is predict how many frames each character takes up. Eg if the input chars are 'a', 'n', 'd', and the duration predictor predicts 2, 4, 4. The embeddings of those chars gets sent to the decoder as 'aannnndddd'. In order to adjust the pace, you can manually play with the output of this duration predictor.

It is also possible to adjust the speaking rate of tacotron2 by adjusting the prenet dropout rate, but too much will result in degraded audio performance.

1 reply

godspirit00 Aug 14, 2021
Author

Thanks for the reply! Looking forward to the documentation on this

blisc · 2021-08-23T14:29:42Z

blisc
Aug 23, 2021
Maintainer

A notebook showing how to adjust speaking rate and pitch for fastpitch has been added to the main branch: https://github.com/NVIDIA/NeMo/blob/main/tutorials/tts/2_Inference_DurationPitchControl.ipynb

Colab link: https://colab.research.google.com/github/NVIDIA/NeMo/blob/main/tutorials/tts/2_Inference_DurationPitchControl.ipynb

3 replies

godspirit00 Aug 24, 2021
Author

Thank you for your time and work! I will try it right away.
You mention that it is also possible to adjust speaking rate of Tacotron2, could you please explain a little bit more as to how I can do it?
Thanks again!

blisc Aug 24, 2021
Maintainer

Since Tacotron2 keeps 1 dropout layer active during inference, you can adjust the dropout rate here: https://github.com/NVIDIA/NeMo/blob/c7ce47e28fab36a89f134ffade770a7a70edf78d/nemo/collections/tts/modules/submodules.py#L169

The default is 50%. But by adjust it between say 45-52, you can still produce intelligible speech at different speaking rates.

godspirit00 Aug 25, 2021
Author

Thanks a lot for your reply! I will have a try right away.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Is it possible to adjust the speaking rate in speech synthesis? #2502

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Is it possible to adjust the speaking rate in speech synthesis? #2502

Uh oh!

godspirit00 Jul 17, 2021

Replies: 2 comments · 4 replies

Uh oh!

blisc Aug 13, 2021 Maintainer

Uh oh!

godspirit00 Aug 14, 2021 Author

Uh oh!

blisc Aug 23, 2021 Maintainer

Uh oh!

godspirit00 Aug 24, 2021 Author

Uh oh!

blisc Aug 24, 2021 Maintainer

Uh oh!

godspirit00 Aug 25, 2021 Author

godspirit00
Jul 17, 2021

Replies: 2 comments 4 replies

blisc
Aug 13, 2021
Maintainer

godspirit00 Aug 14, 2021
Author

blisc
Aug 23, 2021
Maintainer

godspirit00 Aug 24, 2021
Author

blisc Aug 24, 2021
Maintainer

godspirit00 Aug 25, 2021
Author