ASR - SSL Pretraining for a small-medium dataset (~ 300h of Speech) #12510

Nada-gh · 2025-03-06T05:05:28Z

Nada-gh
Mar 6, 2025

I am trying to run SSL (Wav2Vec-BERT) for a small speech dataset (~300 hours), for ASR task. I understand that pretraining should be implemented with big, diverse datasets, resulting in representation learning that can be fine-tuned later for downstream tasks. However, I am trying to explore the gain of SSL for an English meduim dataset of a particular domain and compare it with fine-tuning scenarios. Could you please assist me with some tips I should consider in my case? I am using the fast-conformer config following the instructions in https://github.com/NVIDIA/NeMo/tree/main/examples/asr/speech_pretraining

I also have audio of different durations varying in [1sec-100sec]. I want to make use of all the data I have since it is limited, so I kept the max_duration and min_duration to 100sec and 1sec. Can that conflict with anything in the SSL config? Should I consider small learning rates or a particular variant in terms of the model size (e.g., large vs. XLarge), etc? If I want to use Lhotse, should I consider any settings that suit the size of my dataset (~300 hours)?

I also want to try NEST for SSL. Should I have different settings than the default in the config file since I have a much smaller dataset? What factors should I consider when augmenting with the noise, given the size of the data I have? Any tips can help. Thank you.

Nada-gh · 2025-03-06T13:35:21Z

Nada-gh
Mar 6, 2025
Author

@nithinraok Could you please assist me with some thoughts regarding that?

0 replies

nithinraok · 2025-03-11T16:45:09Z

nithinraok
Mar 11, 2025
Maintainer

It is not recommended to try SSL with that size. I don;t think you would see any benefits.
However if you would want to give it a try, consider following things:

Add noises to increases dataset size
Add spec_aug
Make uniform short segments (like 10 or <10 sec) so model will see more samples try to use offset in manifest to handle this.
you can enable lhotse for improve dataloading
NEST should work with the config you provided

However, you might not see much benefits with 300hrs of pretraining data. You could use this dataset for more hrs: https://github.com/facebookresearch/libri-light

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ASR - SSL Pretraining for a small-medium dataset (~ 300h of Speech) #12510

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

ASR - SSL Pretraining for a small-medium dataset (~ 300h of Speech) #12510

Uh oh!

Uh oh!

Nada-gh Mar 6, 2025

Replies: 2 comments

Uh oh!

Nada-gh Mar 6, 2025 Author

Uh oh!

nithinraok Mar 11, 2025 Maintainer

Nada-gh
Mar 6, 2025

Nada-gh
Mar 6, 2025
Author

nithinraok
Mar 11, 2025
Maintainer