Disable "normalizer" from tokenizer.json #6856
Unanswered
konokonekonoko
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hey, so I've noticed that when using the
and
/tokenize
endpoint with mistral-7b models, a space gets prepended to content. E.g. tokenizingThe
returns the ID forThe
, and subsequently, trying to tokenizeThe
actually returns the IDs forThe
, which is a real headache.After digging around for quite a while, I noticed that the
tokenizer.json
file that's included with the.safetensor
weights has the following code:I was wondering if this was the cause for my problems, and if it is, if there was any way disable this normalization step for the
/tokenize
endpoint in llama.cpp.Beta Was this translation helpful? Give feedback.
All reactions