Tips for training a speaker embeddings model for fast convergence? #10039
Closed
gabitza-tech
started this conversation in
General
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello everybody,
I would like to train an Ecapa-TDNN model on Voxceleb2 and Voxceleb2+Cnceleb2 for speaker verification/identification. I don't want it to be SOA (ex: complicated pretraining and then large margin finetuning), but I want it to be moderately performant as fast as possible, as the model performance is not the primary focus.
I can train on an A100 gpu. I was thinking of randomly cropping audios in train-set to 3s and then applying RIR+musan+(0.95-1.05) speed perturbation+ spec. augment. I would also use the AAM Softmax with s=30, m=0.2 loss. How many epochs would I need to train this model and what is the expected time for training such a model? (on Voxceleb2 for example) Is it possible to obtain good performances with such a simple configuration? (under 1% EER on VoxCeleb1-O for example)
I saw that some people apply offline augmentation to create more speakers, or do 2-stage training with pretraining and then fine-tuning, etc. But I would like to train both models in just a couple of days (as an A100 is a pretty big GPU), and I don't want to achieve peak performances, such as the more complicated pipelines, but decent performances.
Any other tips would be greatly appreciated! I would mostly like to know if I could achieve these performances in a reasonable time and any other tips for fast convergence in as few epochs as possible. :D
Thank you in advance!
Beta Was this translation helpful? Give feedback.
All reactions