Are Tarred + bucketing data sets supposed to be faster? #4217
-
I'm trying to train ASR models on a pretty unbalanced data set (lengthwise, a lot short utterances but also long utterances (up to 1 min)). I have been able to train with smaller batch size and also with some pre-segmentation to shorten the max duration. I'm now just trying bucketing, but it seems like it only works with tarred data sets. After tarring the original set and using some bucketing, I'm seeing that the estimated time for one epoch goes up to ~7h instead of 2h for training with the raw form audio files. If I don't do bucketing and still use the tarred version, I get ~10h estimated time for one epoch. I'm setting something like 4 buckets, 128 shards. Using A100 GPUs on a Google Cloud VM, using NeMo 1.7.0 I'm just wondering, are these numbers normal? If so when does it make more sense to use tarred + bucketing strategy versus just train using the wav files? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 24 replies
-
No this is not expected. At most a few seconds more is possible, not several times longer training. In the logs, can you check and plot the time per step (train and validation) between loose files and tarred / bucketed datasets? If it's not your model itself, then data loading in gcp is the problem. Which is odd because tarred datasets are what we use on such network filesystems. |
Beta Was this translation helpful? Give feedback.
No this is not expected. At most a few seconds more is possible, not several times longer training.
In the logs, can you check and plot the time per step (train and validation) between loose files and tarred / bucketed datasets? If it's not your model itself, then data loading in gcp is the problem. Which is odd because tarred datasets are what we use on such network filesystems.