StyleTTS2 Training from Scratch Notebooks #144
Replies: 4 comments 43 replies
-
Thank you for this! Can't wait to try it. How much of a corpus did you need to train from scratch (minutes/hours of audio) and how long did it take to get a production quality result, when training from scratch? (# of epochs? hours? And on what hardware?) |
Beta Was this translation helpful? Give feedback.
-
I thought that I'd shed some light on the progress so far. It's been a bit slow, especially since I was preparing all of the WAV files and training / validation text by hand to minimize error rate. I currently have around 10 hours of audio ready in 8534 WAV files, spanning from 1 to 22 seconds. At the end, I decided to give it a try with what I have, since the weeks cutting and syncing WAVS to text were really getting boring. I used VAST.ai for GPU rental and got the first stage trained up to 200 epochs in about 3,5 hours. The hardware and settings I used for this training were as follows:
The second stage is still in progress, currently at epoch 67. It's taking considerably longer because it can only run on a single GPU ( due to bug #7 ). Therefore, I'm running it every day for about 12-18 hours which brought the budget up to almost $200 by now. I decided to run it only when I have control over the process, since I didn't know how many epochs I can use that the A100 memory can handle. At the end, I used the following parameters for training:
Example of how inference sounds at epoch 65 as compared to the original TTS voice: https://jmp.sh/aWMQe69G |
Beta Was this translation helpful? Give feedback.
-
I couldn't increase the batch size per a100 by more than 8 when training. How did you manage to advance learning with such a large batch size? |
Beta Was this translation helpful? Give feedback.
-
hello. @martinambrus , Nice to meet you. I have just saw your old posts here. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I'm currently learning how to train a custom StyleTTS2 model from scratch.
I'm very new to this and thanks to this amazing project and its community, I've already gained a considerable amount of knowledge. Here, I'd like to share that knowledge with you.
To that end, I created 2 Jupyter Notebooks that I use for my own model training and audio samples preparation.
I'd like to stress out that these are only my own methods and findings and there probably is a better way to do things, especially in the audio preparation part. But since I tried - and failed - to find a reliable automated method there, some manual fine-tuning steps are still required in order to perfect the audio input.
My 2 Notebooks:
These notebooks can be used on Google Colab but also outside of it, on a dedicated cloud machine running Jupyter.
Perhaps I should clarify that I'm still in the process of learning and as such, I'm yet to create a production-ready model.See my last update below for more information on a production-ready model training and results.
These Notebooks previously served me to create a low-quality model from ~150 WAV files with a duration of 1 - 2.5 seconds as a proof of concept. At the present time, I use them to finalize my production-ready model training.
To that end, I used 2 TensorDock Cloud GPU Machines:
for the 1st training phase, a beefy one (costing approx. $3.38/hr) with:
I've done 1000 epochs, from which only the first 400 were needed - at least from what I can see in the validation loss info (around 0.38 - 0.4). Those 400 epochs were done very quickly (in less then 1 or 2 hours, if I recall correctly) on this small data set.
for the 2nd training phase, I used a similar machine but with only a single GPU, since there is still a bug in code that doesn't allow us to use DDP (accelerate) training in this phase. The cost went down to approx. $0.65/hr and it took about 2-4 hours to finish 100 epochs on this small data set.
I'm currently working towards the creation of a large high-quality corpus, spanning approx. 45 hours of audio.
The source for this corpus are 4 audio-books read by another high-quality (now decomissioned) TTS voice to which I had a commercial license.
My goal is to try and train a similarly-sounding voice model by utilizing StyleTTS2.
Here is an example of how the original TTS voice sounds like: https://jmp.sh/zOrGoel3
I should also mention that I already used the set of those ~150 WAV files to fine-tune StyleTTS2 to the new voice, and even with as little data as those files provide, I was able to achieve a very good voice transfer quality.
I hope these Notebooks will help someone to automate their training, too :)
Beta Was this translation helpful? Give feedback.
All reactions