Skip to content

BUG: Tokenizer length mismatch causes IndexError in NTL initialization (vocab_size < embedding size) #19

@Zuozhuo

Description

@Zuozhuo

From my observation, NTL seems to initialize certain components based on the tokenizer vocabulary size.
Currently, the code uses len(self.tokenizer) (or equivalently len(self.tokenizer.get_vocab())), as shown below:

Image Image Image

However, this is inaccurate.
I found that in many LLMs/VLMs (at least in my case with Qwen2.5), len(tokenizer) is smaller than the actual index range used by the model, i.e. model.get_input_embeddings().weight.shape[0], as shown below:

Image

This leads to an error like:

IndexError: The shape of the mask [151665] at index 0 does not match the shape of the indexed tensor [2, 504, 151936] at index 2


When I was using the Scratch version of the MWE of NTL
I encountered the same issue.

My workaround was to add an explicit function parameter that takes the true vocab_size used by the model during runtime, instead of inferring it from the tokenizer. See below:

Image

Currently, I am using the PyPI version (since it supports Wasserstein distance),
but this issue still needs to be fixed.


Suggestion:
Allow vocab_size to be explicitly passed into the NTL initialization (e.g., NumberTokenLoss(vocab_size=model.get_input_embeddings().weight.shape[0], ...))
to ensure compatibility with models where the tokenizer vocabulary length does not match the actual embedding matrix size.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions