A special token '\u0000' will cause an assert error in 'llm_load_vocab' #5111
Unanswered
SolenoidWGT
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I'm trying to fit an InternLM2 model for llama.cpp, but I get an assertion error when using llama.cpp for inference, below is the error stack. The commit ID of llama.cpp code is 77bc1bb
I further checked and found that the token that caused the error was token

\u0000
in the InternLM2 vocabulary, which would be converted into string "\u0000
" bycodepoints_from_utf8
, which corresponds to the string terminator in C language, resulting inword
size is 0, causing the assertion here to report an error (because I added some debug code, the actual error line number is llama.cpp:3053)I tried to comment out the assertion at llama.cpp:3053, and the model could be generated normally without any other errors. So I would like to ask about the significance of this assertion. Can we relax the assertion conditions here? If I can't remove the assertion, I'd love some advice on how to get around it, thanks.
I searched and found a dscussion similar to my problem, but I didn't get much information.
Here is my sys & env info.
Beta Was this translation helpful? Give feedback.
All reactions