Recommended to pre-split long documents before Nemo Megatron dataset preprocessing (T5, UL2)? #6300

Lauler · 2023-03-27T15:02:18Z

Lauler
Mar 27, 2023

In Nemo Megatron, the default behavior regarding what happens with surplus text that doesn't fit within the maximum sequence length of a model seems to differ between different model implementations. I was wondering whether you could shed some light on the default behaviors for different model classes, and comment for which of the model classes it is necessary and beneficial for the users of the library to pre-split long documents before they are run through megatron preprocessing scripts.

For example, in the Megatron GPT class of models, it appears any excess data in long documents is being used in training by building sample indices within and between documents _build_sample_index() .

However, in T5 and UL2, I'm not really sure what happens. By default respect_document_boundaries is set to True. Instead of using _build_samples_index(), we go to get_samples_index(). As far as I can see, there is no equivalent magic happening there to build sample indices within documents. It also appears everything longer than target_seq_len is being truncated and thrown away.

Some questions I'm hoping you could clarify:

If respect_document_boundaries=True, should I pre-split my long documents before preprocessing them with nemo megatron preprocessing scripts and proceeding with model pretraining?
Is it generally recommended against to use input packing (respect_document_boundaries=False) for pretraining T5/UL2?

Clarifications in the Nemo Megatron docs regarding these differing assumptions for different models would be helpful.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Recommended to pre-split long documents before Nemo Megatron dataset preprocessing (T5, UL2)? #6300

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Recommended to pre-split long documents before Nemo Megatron dataset preprocessing (T5, UL2)? #6300

Uh oh!

Uh oh!

Lauler Mar 27, 2023

Replies: 0 comments

Lauler
Mar 27, 2023