How adaptive bucketing should work? #3302

nasretdinovr · 2021-12-08T15:02:14Z

nasretdinovr
Dec 8, 2021
Collaborator

As I understood, for bucketing to work I have to provide multiple manifest and same amount of tarred_audio_filepaths. So for each pair of manifest and tarred_audio_filepath BucketingDataset with different batch size will be created, right? If it is correct, I don't understand how and why it could make training faster

Answered by VahidooX

Dec 10, 2021

When you use bucketing, audios with similar lengths would be in the same batch, so you would have fewer paddings and it makes the training faster as max_length in a batch would be smaller on average. Adaptive bucketing makes that even more efficient. If you use adaptive bucketing, then buckets with smaller audios would have larger batches and it utilizes the GPUs more efficiently and each epoch would be be done faster. It can give more than 2x speedup in training.

View full answer

VahidooX · 2021-12-10T02:35:50Z

VahidooX
Dec 10, 2021
Collaborator

When you use bucketing, audios with similar lengths would be in the same batch, so you would have fewer paddings and it makes the training faster as max_length in a batch would be smaller on average. Adaptive bucketing makes that even more efficient. If you use adaptive bucketing, then buckets with smaller audios would have larger batches and it utilizes the GPUs more efficiently and each epoch would be be done faster. It can give more than 2x speedup in training.

4 replies

titu1994 Dec 10, 2021
Maintainer

@VahidooX we need proper docs explaining this, it's complex to present. Either in the ASR API docs or as a tutorial (may be good to show how to build tarred dataset, and bucket dataset and how to use them). What do you think ?

nasretdinovr Dec 10, 2021
Collaborator Author

In the function get_tarred_dataset (nemo.collections.asr.data.audio_to_text_dataset) there is a condition if len(datasets) is more than 1, then bucketing is used. But as I understood len(datasets) is more than one in that case, when several tarred_audio_filepath and manifests are provided in the config. Is my logic right?

    if len(datasets) > 1:
        if config.get('bucketing_batch_size', None) is not None:
            bucketing_batch_sizes = calc_bucketing_batch_sizes(config, len(datasets))
            logging.info(
                f"Batch bucketing is enabled for {len(datasets)} buckets with adaptive batch sizes of {bucketing_batch_sizes}!"
            )
        else:
            logging.info(
                f"Batch bucketing is enabled for {len(datasets)} buckets with fixed batch size of {config['batch_size']}!"
            )

        for idx, dataset in enumerate(datasets):
            datasets[idx] = audio_to_text.BucketingDataset(
                dataset=dataset, bucketing_batch_size=bucketing_batch_sizes[idx]
            )

VahidooX Dec 10, 2021
Collaborator

I am going to add proper documentation for the next release. When several manifests are passed or several tarred files are passed, bucketing would not get enabled and datasets list would have just one item (one dataset object). The tarred_audio_filepath and manifest_filepath should be a list of lists when we get to this point in the code and even if user passes a list, they would get standardized to this format. The function of convert_to_config_list called here ( https://github.com/NVIDIA/NeMo/blob/e3ba99018974b216f76741e7ada0867e69b38cea/nemo/collections/asr/data/audio_to_text_dataset.py#L162 ) is responsible for converting them into that format.

nasretdinovr Dec 10, 2021
Collaborator Author

Thank you for answer! Waiting for documentation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How adaptive bucketing should work? #3302

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

How adaptive bucketing should work? #3302

Uh oh!

nasretdinovr Dec 8, 2021 Collaborator

Replies: 1 comment · 4 replies

Uh oh!

VahidooX Dec 10, 2021 Collaborator

Uh oh!

titu1994 Dec 10, 2021 Maintainer

Uh oh!

nasretdinovr Dec 10, 2021 Collaborator Author

Uh oh!

Uh oh!

VahidooX Dec 10, 2021 Collaborator

Uh oh!

nasretdinovr Dec 10, 2021 Collaborator Author

nasretdinovr
Dec 8, 2021
Collaborator

Replies: 1 comment 4 replies

VahidooX
Dec 10, 2021
Collaborator

titu1994 Dec 10, 2021
Maintainer

nasretdinovr Dec 10, 2021
Collaborator Author

VahidooX Dec 10, 2021
Collaborator

nasretdinovr Dec 10, 2021
Collaborator Author