Are Tarred + bucketing data sets supposed to be faster? #4217

riqiang-dp · 2022-05-20T15:09:39Z

riqiang-dp
May 20, 2022

I'm trying to train ASR models on a pretty unbalanced data set (lengthwise, a lot short utterances but also long utterances (up to 1 min)). I have been able to train with smaller batch size and also with some pre-segmentation to shorten the max duration. I'm now just trying bucketing, but it seems like it only works with tarred data sets.

After tarring the original set and using some bucketing, I'm seeing that the estimated time for one epoch goes up to ~7h instead of 2h for training with the raw form audio files. If I don't do bucketing and still use the tarred version, I get ~10h estimated time for one epoch.

I'm setting something like 4 buckets, 128 shards. Using A100 GPUs on a Google Cloud VM, using NeMo 1.7.0

I'm just wondering, are these numbers normal? If so when does it make more sense to use tarred + bucketing strategy versus just train using the wav files?

Answered by titu1994

May 20, 2022

No this is not expected. At most a few seconds more is possible, not several times longer training.

In the logs, can you check and plot the time per step (train and validation) between loose files and tarred / bucketed datasets? If it's not your model itself, then data loading in gcp is the problem. Which is odd because tarred datasets are what we use on such network filesystems.

View full answer

titu1994 · 2022-05-20T18:47:02Z

titu1994
May 20, 2022
Maintainer

No this is not expected. At most a few seconds more is possible, not several times longer training.

In the logs, can you check and plot the time per step (train and validation) between loose files and tarred / bucketed datasets? If it's not your model itself, then data loading in gcp is the problem. Which is odd because tarred datasets are what we use on such network filesystems.

24 replies

titu1994 Jun 7, 2023
Maintainer

For bucketing only max steps should be used.

titu1994 Jun 7, 2023
Maintainer

Please read the documentation here of how to setup bucketing - https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/asr/datasets.html#bucketing-datasets

mehadi92 Jun 8, 2023

For bucketing only max steps should be used.

It would be great if this information is added to the Nemo documentation

VahidooX Jun 15, 2023
Collaborator

Bucketing should work with both max_epochs and max_steps. Something is wrong here. The number of steps with bucketing should be smaller, not larger. Would you please try a list for bucketing_batch_size? something like [60,48,36,24] for 4 buckets.

mehadi92 Jun 15, 2023

@VahidooX, Thanks for your reply.

To compare bucketing and no-bucketing I try also try following,

with bucketing
data = 250 hr
max audio duration = 18.5
buckets = 3
shads = 64
bucketing batch size = [128, 64, 32]
** Above setup takes ~14 min to complete 1 epoch **

on the other hand `with no bucketing setup
batch size = 64
To complete 1 epoch it takes ~12 min

I use max_epochs in both cases

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Are Tarred + bucketing data sets supposed to be faster? #4217

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 24 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Are Tarred + bucketing data sets supposed to be faster? #4217

Uh oh!

Uh oh!

riqiang-dp May 20, 2022

Replies: 1 comment · 24 replies

Uh oh!

titu1994 May 20, 2022 Maintainer

Uh oh!

titu1994 Jun 7, 2023 Maintainer

Uh oh!

titu1994 Jun 7, 2023 Maintainer

Uh oh!

mehadi92 Jun 8, 2023

Uh oh!

VahidooX Jun 15, 2023 Collaborator

Uh oh!

mehadi92 Jun 15, 2023

riqiang-dp
May 20, 2022

Replies: 1 comment 24 replies

titu1994
May 20, 2022
Maintainer

titu1994 Jun 7, 2023
Maintainer

titu1994 Jun 7, 2023
Maintainer

VahidooX Jun 15, 2023
Collaborator