BUG: Tokenizer length mismatch causes IndexError in NTL initialization (vocab_size < embedding size)

From my observation, NTL seems to initialize certain components based on the tokenizer vocabulary size.  
Currently, the code uses `len(self.tokenizer)` (or equivalently `len(self.tokenizer.get_vocab())`), as shown below:

<img width="1266" height="143" alt="Image" src="https://github.com/user-attachments/assets/a02d01bc-4358-408a-916b-cda9e4d5fdc4" />

<img width="1308" height="167" alt="Image" src="https://github.com/user-attachments/assets/ecc2019f-0990-4ea4-a2b2-f4bb1a1feded" />

<img width="1480" height="559" alt="Image" src="https://github.com/user-attachments/assets/e5b43bde-74d4-436d-be13-71b350c4131d" />

---

However, this is **inaccurate**.  
I found that in many LLMs/VLMs (at least in my case with Qwen2.5), `len(tokenizer)` is **smaller** than the actual index range used by the model, i.e. `model.get_input_embeddings().weight.shape[0]`, as shown below:

<img width="1287" height="222" alt="Image" src="https://github.com/user-attachments/assets/47ef556c-062b-4318-a010-c6c24801e9c0" />

This leads to an error like:

`IndexError: The shape of the mask [151665] at index 0 does not match the shape of the indexed tensor [2, 504, 151936] at index 2`

---

When I was using the Scratch version of the [MWE of NTL](https://github.com/tum-ai/number-token-loss/blob/main/src/ntl/loss_functions/base_number_token_loss.py)
I encountered the same issue.  

My workaround was to add an **explicit function parameter** that takes the true `vocab_size` used by the model during runtime, instead of inferring it from the tokenizer. See below:

<img width="1347" height="701" alt="Image" src="https://github.com/user-attachments/assets/de399dce-25c3-4adf-8127-dbdd7279f904" />

Currently, I am using the **PyPI version** (since it supports Wasserstein distance),  
but this issue still needs to be fixed.

---

**Suggestion:**  
Allow `vocab_size` to be explicitly passed into the NTL initialization (e.g., `NumberTokenLoss(vocab_size=model.get_input_embeddings().weight.shape[0], ...)`)  
to ensure compatibility with models where the tokenizer vocabulary length does not match the actual embedding matrix size.




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

BUG: Tokenizer length mismatch causes IndexError in NTL initialization (vocab_size < embedding size) #19

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

BUG: Tokenizer length mismatch causes IndexError in NTL initialization (vocab_size < embedding size) #19

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions