StyleTTS2 Training from Scratch Notebooks #144

martinambrus · 2023-12-08T21:12:59Z

martinambrus
Dec 8, 2023

I'm currently learning how to train a custom StyleTTS2 model from scratch.

I'm very new to this and thanks to this amazing project and its community, I've already gained a considerable amount of knowledge. Here, I'd like to share that knowledge with you.

To that end, I created 2 Jupyter Notebooks that I use for my own model training and audio samples preparation.

I'd like to stress out that these are only my own methods and findings and there probably is a better way to do things, especially in the audio preparation part. But since I tried - and failed - to find a reliable automated method there, some manual fine-tuning steps are still required in order to perfect the audio input.

My 2 Notebooks:

These notebooks can be used on Google Colab but also outside of it, on a dedicated cloud machine running Jupyter.

~~Perhaps I should clarify that I'm still in the process of learning and as such, I'm yet to create a production-ready model.~~
See my last update below for more information on a production-ready model training and results.

These Notebooks previously served me to create a low-quality model from ~150 WAV files with a duration of 1 - 2.5 seconds as a proof of concept. At the present time, I use them to finalize my production-ready model training.
To that end, I used 2 TensorDock Cloud GPU Machines:

for the 1st training phase, a beefy one (costing approx. $3.38/hr) with:
- 8xRTX A6000 48GB GPUs
- 120GB RAM
- 20 CPUs
- 500GB of HDD space
I've done 1000 epochs, from which only the first 400 were needed - at least from what I can see in the validation loss info (around 0.38 - 0.4). Those 400 epochs were done very quickly (in less then 1 or 2 hours, if I recall correctly) on this small data set.
for the 2nd training phase, I used a similar machine but with only a single GPU, since there is still a bug in code that doesn't allow us to use DDP (accelerate) training in this phase. The cost went down to approx. $0.65/hr and it took about 2-4 hours to finish 100 epochs on this small data set.

I'm currently working towards the creation of a large high-quality corpus, spanning approx. 45 hours of audio.
The source for this corpus are 4 audio-books read by another high-quality (now decomissioned) TTS voice to which I had a commercial license.
My goal is to try and train a similarly-sounding voice model by utilizing StyleTTS2.
Here is an example of how the original TTS voice sounds like: https://jmp.sh/zOrGoel3

I should also mention that I already used the set of those ~150 WAV files to fine-tune StyleTTS2 to the new voice, and even with as little data as those files provide, I was able to achieve a very good voice transfer quality.

I hope these Notebooks will help someone to automate their training, too :)

platform-kit · 2023-12-08T21:49:46Z

platform-kit
Dec 8, 2023

Thank you for this! Can't wait to try it. How much of a corpus did you need to train from scratch (minutes/hours of audio) and how long did it take to get a production quality result, when training from scratch? (# of epochs? hours? And on what hardware?)

3 replies

martinambrus Dec 9, 2023
Author

I edited my original text to provide more info.

platform-kit Dec 11, 2023

Thanks. BTW your audio sample link has expired.

martinambrus Dec 11, 2023
Author

sorry, didn't know they expire on that platform... link updated to something more stable

martinambrus · 2024-01-09T13:41:05Z

martinambrus
Jan 9, 2024
Author

I thought that I'd shed some light on the progress so far.

It's been a bit slow, especially since I was preparing all of the WAV files and training / validation text by hand to minimize error rate. I currently have around 10 hours of audio ready in 8534 WAV files, spanning from 1 to 22 seconds.

At the end, I decided to give it a try with what I have, since the weeks cutting and syncing WAVS to text were really getting boring. I used VAST.ai for GPU rental and got the first stage trained up to 200 epochs in about 3,5 hours. The hardware and settings I used for this training were as follows:

machine
- 112 GB HDD
- 86 GB RAM
- 8x A100X 80GB GPUs
- AMD EPYC 7302 16-Core Processor, 64 cores total
config
- epochs_1st = 200
- batch_size = 408 (until epoch 50, since that's where additional validation steps kick in), then 288 after epoch 50
- max_len = 1760 (for the longest 22 seconds audio file -> 22 / 0.0125 = 1760 )
- save_freq = 10 (to save HDD space, I only saved every 10th checkpoint)
cost
- $11.231/h x 3,5 hours = $33.7

The second stage is still in progress, currently at epoch 67. It's taking considerably longer because it can only run on a single GPU ( due to bug #7 ). Therefore, I'm running it every day for about 12-18 hours which brought the budget up to almost $200 by now. I decided to run it only when I have control over the process, since I didn't know how many epochs I can use that the A100 memory can handle.

At the end, I used the following parameters for training:

machine
- 112 GB HDD
- 86 GB RAM
- 1x A100X 80GB GPU
- AMD EPYC 7302 16-Core Processor, 64 cores total
config
- epochs_2nd = 200
- batch_size = 32 (until epoch 20, since that's where additional validation steps kick in), then I had to tune this down from 14 gradually to only 6 (which is stable but really slow now)
- max_len = 1760 (for the longest 22 seconds audio file -> 22 / 0.0125 = 1760 )
- save_freq = 1 (because it takes ages for a checkpoint and I don't want to loose even a single one + this allows me to test each checkpoint as it's created)
cost (so far)
- $1.863/h x 108 hours = $201.2

Example of how inference sounds at epoch 65 as compared to the original TTS voice: https://jmp.sh/aWMQe69G
Original TTS: https://jmp.sh/zOrGoel3

16 replies

ethan-digi Mar 25, 2024

@martinambrus Would you be willing to share your loss curves/logs for second stage training? I'm also re-training from scratch and fear something is amiss with diffusion. Would be greatly appreciated if you can.

martinambrus Mar 27, 2024
Author

hi @ethan-digi - I've been digging through what I backed up from the training machine and unfortunately, I can't find data from 2nd stage training... I'm sorry, I'd gladly send them to you but I'm afraid I did not back those up :(

ethan-digi Mar 27, 2024

@martinambrus All good, I appreciate you checking regardless. Let me ask one thing then - do you recall, vaguely, what the loss curve/pattern looked like for diffusion? Mine isn't meaningfully decreasing, but other losses are

martinambrus Mar 28, 2024
Author

@ethan-digi I don't recall exactly but I don't think I saw anything off while watching the log - all of the losses decreased as expected... but perhaps you could open an issue in the repo with your output, so other people who also trained from scratch may be able to help you out?

martinambrus Sep 21, 2024
Author

If anyone would be interested in seeing some curves and train data, check out the link in this issue: #281

junylee11 · 2024-02-15T07:16:23Z

junylee11
Feb 15, 2024

I couldn't increase the batch size per a100 by more than 8 when training. How did you manage to advance learning with such a large batch size?

21 replies

Lwasinam Oct 10, 2024

@martinambrus
I'm currently training on 500hrs of data
Second stage and I get OOM errors after a while. I'm using 8xA100s 40GB
My batch size is 16 (2 for each GPU) this is the least I can go

My Max_len is about 250, increasing it any further would definitely use more memory usage and my not accommodate joint epoch training

martinambrus Oct 10, 2024
Author

@Lwasinam from my experience, I couldn't fit more than a batch of 12 into a single A100 80GB, so you're probably using a much bigger batch than what the GPU can process. Also, setting max_len to 250 will only ever process 3.125s of each audio file. That's probably not what you want, unless your audio files are at max 3.125s long. If they are longer, I would suggest using a higher max_len value (ideally to the maximum length of your longest audio file). Otherwise, your model will learn to use randomly and abruptly-cut audio endings, which will not help the final quality. Also, you will most probably need to lower your batch size and/or trying to use 80GB GPUs instead of 40GB ones. One last thing - you don't need 500hrs of data if they are from a single speaker. For a single speaker, around 10 hours of data is quite enough - even less if you can afford to spend more time training (i.e. going through more epochs).

Lwasinam Oct 21, 2024

Thanks so much

Lwasinam Nov 2, 2024

@martinambrus when training on lets say a 10hr single speaker dataset,
during second stage training, are the logged outputs clear from the first epoch, I mean clearly understandable?

martinambrus Nov 2, 2024
Author

I'm not sure what you mean but here are the data if you want to review them yourself: #281

DevOps920719 · 2024-07-05T15:57:00Z

DevOps920719
Jul 5, 2024

hello. @martinambrus , Nice to meet you. I have just saw your old posts here.
Could we co-operate together? My email is priyam.jha0124@gmail.com
I am really exited about the possibility to co-operate with you.
Thank you.

3 replies

martinambrus Jul 6, 2024
Author

hi, what sort of cooperation you're after?

moesaeed Mar 30, 2025

@martinambrus Could we do a paid collaboration on training for different languages?
We have 400 hours chucks into 8-15 sec with transcripts+ we have the required GPU.

martinambrus Mar 30, 2025
Author

@moesaeed hi - so far I did English and a test for Czech language but even with 10 hours of audio, I wasn't able to reproduce the quality of the original English TTS. I'm open for discussion though. Feel free to contact me at 09wmd42of@mozmail.com

StyleTTS2 Training from Scratch Notebooks #144

Uh oh!

Uh oh!

Replies: 4 comments · 43 replies

Uh oh!

Uh oh!

Uh oh!

martinambrus Dec 9, 2023 Author

Uh oh!

Uh oh!

martinambrus Dec 11, 2023 Author

Uh oh!

Uh oh!

martinambrus Jan 9, 2024 Author

Uh oh!

Uh oh!

martinambrus Mar 27, 2024 Author

Uh oh!

Uh oh!

martinambrus Mar 28, 2024 Author

Uh oh!

martinambrus Sep 21, 2024 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

martinambrus Oct 10, 2024 Author

Uh oh!

Uh oh!

Uh oh!

martinambrus Nov 2, 2024 Author

Uh oh!

Uh oh!

martinambrus Jul 6, 2024 Author

Uh oh!

Uh oh!

martinambrus Mar 30, 2025 Author

Replies: 4 comments 43 replies

martinambrus Dec 9, 2023
Author

martinambrus Dec 11, 2023
Author

martinambrus
Jan 9, 2024
Author

martinambrus Mar 27, 2024
Author

martinambrus Mar 28, 2024
Author

martinambrus Sep 21, 2024
Author

martinambrus Oct 10, 2024
Author

martinambrus Nov 2, 2024
Author

martinambrus Jul 6, 2024
Author

martinambrus Mar 30, 2025
Author