How to set bucketing_batch_size config for adaptive-size bucketing? #6016
-
I've started using bucketed datasets for training an ASR. The related part of the config for this is: cfg.train_ds.is_tarred = True
cfg.train_ds.tarred_audio_filepaths = f"{train_dir}/bucket1/audio__OP_0..511_CL_.tar,\
{train_dir}/bucket2/audio__OP_0..511_CL_.tar,\
{train_dir}/bucket3/audio__OP_0..511_CL_.tar,\
{train_dir}/bucket4/audio__OP_0..511_CL_.tar,\
{train_dir}/bucket5/audio__OP_0..511_CL_.tar,\
{train_dir}/bucket6/audio__OP_0..511_CL_.tar,\
{train_dir}/bucket7/audio__OP_0..511_CL_.tar,\
{train_dir}/bucket8/audio__OP_0..511_CL_.tar"
cfg.train_ds.bucketing_batch_size = 16 I've set the parameter I'm not sure whether I'm enabling the adaptive-size bucketing in the right way. Could someone verify the configuration to enable adaptive-size bucketing? |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments
-
@VahidooX we should log what type of data loader is being used inside of the dataset classes |
Beta Was this translation helpful? Give feedback.
-
For bucketing you need to pass them as a list of lists like [[],[],[],[]]. Please take a look at the documentation for the right format. |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
For bucketing you need to pass them as a list of lists like [[],[],[],[]]. Please take a look at the documentation for the right format.
For adaptive, you need to set the train_ds.batch_size=1 and train_ds.bucketing_batch_size to a fixed number for linear scaling or set them manually like this train_ds.bucketing_batch_size = [70,64,56,48,40,32,24,16]. Linear scaling can be aggressive in many cases, so suggest to specify the batch size for each bucket manually like this.
You may set the train_ds.bucketing_strategy=fully_randomized to have lower speedup but probably better accuracy.