You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In Nemo Megatron, the default behavior regarding what happens with surplus text that doesn't fit within the maximum sequence length of a model seems to differ between different model implementations. I was wondering whether you could shed some light on the default behaviors for different model classes, and comment for which of the model classes it is necessary and beneficial for the users of the library to pre-split long documents before they are run through megatron preprocessing scripts.
For example, in the Megatron GPT class of models, it appears any excess data in long documents is being used in training by building sample indices within and between documents _build_sample_index() .
However, in T5 and UL2, I'm not really sure what happens. By default respect_document_boundaries is set to True. Instead of using _build_samples_index(), we go to get_samples_index(). As far as I can see, there is no equivalent magic happening there to build sample indices within documents. It also appears everything longer than target_seq_len is being truncated and thrown away.
Some questions I'm hoping you could clarify:
If respect_document_boundaries=True, should I pre-split my long documents before preprocessing them with nemo megatron preprocessing scripts and proceeding with model pretraining?
Is it generally recommended against to use input packing (respect_document_boundaries=False) for pretraining T5/UL2?
Clarifications in the Nemo Megatron docs regarding these differing assumptions for different models would be helpful.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
In Nemo Megatron, the default behavior regarding what happens with surplus text that doesn't fit within the maximum sequence length of a model seems to differ between different model implementations. I was wondering whether you could shed some light on the default behaviors for different model classes, and comment for which of the model classes it is necessary and beneficial for the users of the library to pre-split long documents before they are run through megatron preprocessing scripts.
For example, in the Megatron GPT class of models, it appears any excess data in long documents is being used in training by building sample indices within and between documents
_build_sample_index()
.However, in T5 and UL2, I'm not really sure what happens. By default
respect_document_boundaries
is set toTrue
. Instead of using_build_samples_index()
, we go toget_samples_index()
. As far as I can see, there is no equivalent magic happening there to build sample indices within documents. It also appears everything longer thantarget_seq_len
is being truncated and thrown away.Some questions I'm hoping you could clarify:
respect_document_boundaries=True
, should I pre-split my long documents before preprocessing them with nemo megatron preprocessing scripts and proceeding with model pretraining?respect_document_boundaries=False
) for pretraining T5/UL2?Clarifications in the Nemo Megatron docs regarding these differing assumptions for different models would be helpful.
Beta Was this translation helpful? Give feedback.
All reactions