Skip to content
Discussion options

You must be logged in to vote

Hi, thanks for the incredibly detailed discussion ! So first off let's start of with some important considerations -

  1. You have Tokenizer, are referring to bpe config, but are manually inserting parts of config from character based models config. Please refer to the bpe / subword config only when using Tokenizers. Mixing the two will not work and may silently cause major issues.

  2. validation set cannot be used with tarred dataset. It is also quite wasteful to have over 10 hours of validation data, cause you'll mostly cover the whole vocabulary with that much. Also, NeMo does not support Val and test data loaders being tarred datasets cause they drop samples which would make results incomp…

Replies: 2 comments 8 replies

Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
8 replies
@titu1994
Comment options

@Oscaarjs
Comment options

@titu1994
Comment options

@Oscaarjs
Comment options

@titu1994
Comment options

Answer selected by titu1994
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants