How to set bucketing_batch_size config for adaptive-size bucketing? #6016

shuvohishab · 2023-02-14T12:07:56Z

shuvohishab
Feb 14, 2023

I've started using bucketed datasets for training an ASR. The related part of the config for this is:

cfg.train_ds.is_tarred = True

cfg.train_ds.tarred_audio_filepaths = f"{train_dir}/bucket1/audio__OP_0..511_CL_.tar,\
{train_dir}/bucket2/audio__OP_0..511_CL_.tar,\
{train_dir}/bucket3/audio__OP_0..511_CL_.tar,\
{train_dir}/bucket4/audio__OP_0..511_CL_.tar,\
{train_dir}/bucket5/audio__OP_0..511_CL_.tar,\
{train_dir}/bucket6/audio__OP_0..511_CL_.tar,\
{train_dir}/bucket7/audio__OP_0..511_CL_.tar,\
{train_dir}/bucket8/audio__OP_0..511_CL_.tar"

cfg.train_ds.bucketing_batch_size = 16

I've set the parameter bucketing_batch_size for 8 buckets after getting the idea from here: https://github.com/NVIDIA/NeMo/blob/1b925f763bd5ea6a738ebf8e333f98ecc48c3188/docs/source/asr/datasets.rst#bucketing-datasets

I'm not sure whether I'm enabling the adaptive-size bucketing in the right way. Could someone verify the configuration to enable adaptive-size bucketing?
Thank you.

Answered by VahidooX

Feb 14, 2023

For bucketing you need to pass them as a list of lists like [[],[],[],[]]. Please take a look at the documentation for the right format.
For adaptive, you need to set the train_ds.batch_size=1 and train_ds.bucketing_batch_size to a fixed number for linear scaling or set them manually like this train_ds.bucketing_batch_size = [70,64,56,48,40,32,24,16]. Linear scaling can be aggressive in many cases, so suggest to specify the batch size for each bucket manually like this.
You may set the train_ds.bucketing_strategy=fully_randomized to have lower speedup but probably better accuracy.

View full answer

titu1994 · 2023-02-14T17:14:08Z

titu1994
Feb 14, 2023
Maintainer

@VahidooX we should log what type of data loader is being used inside of the dataset classes

0 replies

VahidooX · 2023-02-14T20:38:30Z

VahidooX
Feb 14, 2023
Collaborator

For bucketing you need to pass them as a list of lists like [[],[],[],[]]. Please take a look at the documentation for the right format.
For adaptive, you need to set the train_ds.batch_size=1 and train_ds.bucketing_batch_size to a fixed number for linear scaling or set them manually like this train_ds.bucketing_batch_size = [70,64,56,48,40,32,24,16]. Linear scaling can be aggressive in many cases, so suggest to specify the batch size for each bucket manually like this.
You may set the train_ds.bucketing_strategy=fully_randomized to have lower speedup but probably better accuracy.

0 replies

shuvohishab · 2023-02-15T06:30:37Z

shuvohishab
Feb 15, 2023
Author

Thanks @VahidooX and @titu1994. I think I've got my answer.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to set bucketing_batch_size config for adaptive-size bucketing? #6016

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 3 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

How to set bucketing_batch_size config for adaptive-size bucketing? #6016

Uh oh!

Uh oh!

shuvohishab Feb 14, 2023

Replies: 3 comments

Uh oh!

titu1994 Feb 14, 2023 Maintainer

Uh oh!

VahidooX Feb 14, 2023 Collaborator

Uh oh!

shuvohishab Feb 15, 2023 Author

shuvohishab
Feb 14, 2023

titu1994
Feb 14, 2023
Maintainer

VahidooX
Feb 14, 2023
Collaborator

shuvohishab
Feb 15, 2023
Author