-
Notifications
You must be signed in to change notification settings - Fork 1
Description
From my observation, NTL seems to initialize certain components based on the tokenizer vocabulary size.
Currently, the code uses len(self.tokenizer)
(or equivalently len(self.tokenizer.get_vocab())
), as shown below:



However, this is inaccurate.
I found that in many LLMs/VLMs (at least in my case with Qwen2.5), len(tokenizer)
is smaller than the actual index range used by the model, i.e. model.get_input_embeddings().weight.shape[0]
, as shown below:

This leads to an error like:
IndexError: The shape of the mask [151665] at index 0 does not match the shape of the indexed tensor [2, 504, 151936] at index 2
When I was using the Scratch version of the MWE of NTL
I encountered the same issue.
My workaround was to add an explicit function parameter that takes the true vocab_size
used by the model during runtime, instead of inferring it from the tokenizer. See below:

Currently, I am using the PyPI version (since it supports Wasserstein distance),
but this issue still needs to be fixed.
Suggestion:
Allow vocab_size
to be explicitly passed into the NTL initialization (e.g., NumberTokenLoss(vocab_size=model.get_input_embeddings().weight.shape[0], ...)
)
to ensure compatibility with models where the tokenizer vocabulary length does not match the actual embedding matrix size.