Is it possible to adjust the speaking rate in speech synthesis? #2502
-
Hello, I see in the DeepLearningExamples repo that fastpitch allows adjusting the speaking rate by passing a flag Thanks! |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 4 replies
-
Yes, it is possible to adjust the speaking rate of duration predictor models (ie most of the fast- models). It is unforunately undocumented. We plan on adding a notebook to explain how to do so soon. Technically, all the duration predictor does is predict how many frames each character takes up. Eg if the input chars are 'a', 'n', 'd', and the duration predictor predicts 2, 4, 4. The embeddings of those chars gets sent to the decoder as 'aannnndddd'. In order to adjust the pace, you can manually play with the output of this duration predictor. It is also possible to adjust the speaking rate of tacotron2 by adjusting the prenet dropout rate, but too much will result in degraded audio performance. |
Beta Was this translation helpful? Give feedback.
-
A notebook showing how to adjust speaking rate and pitch for fastpitch has been added to the main branch: https://github.com/NVIDIA/NeMo/blob/main/tutorials/tts/2_Inference_DurationPitchControl.ipynb |
Beta Was this translation helpful? Give feedback.
A notebook showing how to adjust speaking rate and pitch for fastpitch has been added to the main branch: https://github.com/NVIDIA/NeMo/blob/main/tutorials/tts/2_Inference_DurationPitchControl.ipynb
Colab link: https://colab.research.google.com/github/NVIDIA/NeMo/blob/main/tutorials/tts/2_Inference_DurationPitchControl.ipynb