Understanding the Role of Special Tokens and Parsing in llama_tokenize #9379

cyannnne · 2024-09-09T03:28:37Z

cyannnne
Sep 9, 2024

// main.cpp
const auto line_inp = ::llama_tokenize(ctx, buffer, false, false);
// server.cpp
prompt_tokens = ::llama_tokenize(ctx, s, add_special, TMP_FORCE_SPECIAL);
// where, add_special = true and TMP_FORCE_SPECIAL = true

I'm trying to understand the purpose of the special boolean. In both main.cpp and server.cpp, s or buffer will be the same as my input string, yet despite special being set differently in both files, the generated output seems unaffected.
I assume "special" refers to tokens like <bos>, <eos>, or <|im_start|>, and setting add_special = true adds <bos> to the start of the input string.
My question is: what is the parse_special, and when should it be used? If parse_special = true, should the input string follow this format?

<|im_start|>system
{system prompt}
<|im_end|>
<|im_start|>user
{user input}
<|im_end|>
<|im_start|>assistant

I tried setting parse_special = true and inputting the string in this format, but the response included special characters(<|im_end|>), which makes me confusing a lot.

Answered by ggerganov

Sep 9, 2024

parse_special = false will disable usage of special tokens during tokenization. This is useful when the text that you want to tokenize includes the text of special tokens (e.g. "the token 123 is identified by the string '<|im_start|>'").

An easy way to understand the difference is to modify the tests/test-tokenizer-0.cpp program like this and run it with a ChatML-based model such as Qwen2 Instruct:

diff --git a/tests/test-tokenizer-0.cpp b/tests/test-tokenizer-0.cpp
index d3d21331..392e17a7 100644
--- a/tests/test-tokenizer-0.cpp
+++ b/tests/test-tokenizer-0.cpp
@@ -169,6 +169,29 @@ int main(int argc, char **argv) {
         }
     }
 
+    {
+        const std::string text = "<|im_start|…

View full answer

ggerganov · 2024-09-09T07:24:49Z

ggerganov
Sep 9, 2024
Maintainer

parse_special = false will disable usage of special tokens during tokenization. This is useful when the text that you want to tokenize includes the text of special tokens (e.g. "the token 123 is identified by the string '<|im_start|>'").

An easy way to understand the difference is to modify the tests/test-tokenizer-0.cpp program like this and run it with a ChatML-based model such as Qwen2 Instruct:

diff --git a/tests/test-tokenizer-0.cpp b/tests/test-tokenizer-0.cpp
index d3d21331..392e17a7 100644
--- a/tests/test-tokenizer-0.cpp
+++ b/tests/test-tokenizer-0.cpp
@@ -169,6 +169,29 @@ int main(int argc, char **argv) {
         }
     }
 
+    {
+        const std::string text = "<|im_start|>Hello World<|im_end|>";
+        printf("text: '%s'\n\n", text.c_str());
+
+        // tokenize with parse_special == false
+        const std::vector<llama_token> res = llama_tokenize(ctx, text, false, false);
+        printf("parse_special == false:\n");
+        for (const auto & tok : res) {
+            printf("\t%7d ('%s')\n", tok, llama_token_to_piece(ctx, tok).c_str());
+        }
+        printf("\n");
+
+        // tokenize with parse_special == true
+        const std::vector<llama_token> res2 = llama_tokenize(ctx, text, false, true);
+        printf("parse_special  == true:\n");
+        for (const auto & tok : res2) {
+            printf("\t%7d ('%s')\n", tok, llama_token_to_piece(ctx, tok).c_str());
+        }
+        printf("\n");
+
+        exit(0);
+    }
+
 #ifdef _WIN32
     // We need this for unicode console support
     console::init(false, false);

make -j tests && ./tests/test-tokenizer-0 models/qwen2-7b-instruct/ggml-model-f16.gguf 

...

text: '<|im_start|>Hello World<|im_end|>'

parse_special == false:
	    27 ('<')
	    91 ('|')
	   318 ('im')
	  4906 ('_start')
	    91 ('|')
	    29 ('>')
	  9707 ('Hello')
	  4337 (' World')
	    27 ('<')
	    91 ('|')
	   318 ('im')
	  6213 ('_end')
	    91 ('|')
	    29 ('>')

parse_special  == true:
	151644 ('<|im_start|>')
	  9707 ('Hello')
	  4337 (' World')
	151645 ('<|im_end|>')

5 replies

cyannnne Sep 9, 2024
Author

Thank you so much for your detailed reply; I really appreciate it!
So why does the response contain the EOT token at the end?

I was testing the gemma model, but it occasionally includes the EOT token at the end of the response. Should I manually check for and remove that EOT?
The EOG token is not included unless I set special=true in llama_token_to_piece.

std::__1::string llama_token_to_piece(const llama_context *ctx, llama_token token, bool special = true)

ggerganov Sep 9, 2024
Maintainer

It's hard to say - what model and commands are you using? Sometimes the model can have misconfigured tokens in the tokenizer.json.

cyannnne Sep 9, 2024
Author

I'm using gemma-2b-it-q4_k_m.gguf from here(https://huggingface.co/lmstudio-ai/gemma-2b-it-GGUF/tree/main) and the command I was testing is

./server -m ~/Downloads/gemma-2b-it-q4_k_m.gguf -cnv --chat-template gemma
./main -m ~/Downloads/gemma-2b-it-q4_k_m.gguf -cnv --chat-template gemma

I set parse_special=true in main.cpp before running this test.

Sometimes the model can have misconfigured tokens in the tokenizer.json.

maybe this is the file that I need to check.

Thank you for your help!

ggerganov Sep 9, 2024
Maintainer

You can simply use the "GGUF Editor" (shoutout to @CISC):

https://huggingface.co/spaces/CISCai/gguf-editor

Here I checked what is the type of the <end_of_turn> token and in this model it is misconfigured as NORMAL instead of as CONTROL type:

You can try to generate a gemma GGUF file from scratch yourself, using the convert scripts in this repository or the gguf-my-repo tool. This should produce the correct vocabulary and token types.

cyannnne Sep 9, 2024
Author

Thank you very much! It’s truly a great help.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Understanding the Role of Special Tokens and Parsing in llama_tokenize #9379

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 5 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Understanding the Role of Special Tokens and Parsing in llama_tokenize #9379

Uh oh!

cyannnne Sep 9, 2024

Replies: 1 comment · 5 replies

Uh oh!

ggerganov Sep 9, 2024 Maintainer

Uh oh!

cyannnne Sep 9, 2024 Author

Uh oh!

ggerganov Sep 9, 2024 Maintainer

Uh oh!

Uh oh!

cyannnne Sep 9, 2024 Author

Uh oh!

ggerganov Sep 9, 2024 Maintainer

Uh oh!

cyannnne Sep 9, 2024 Author

cyannnne
Sep 9, 2024

Replies: 1 comment 5 replies

ggerganov
Sep 9, 2024
Maintainer

cyannnne Sep 9, 2024
Author

ggerganov Sep 9, 2024
Maintainer

cyannnne Sep 9, 2024
Author

ggerganov Sep 9, 2024
Maintainer

cyannnne Sep 9, 2024
Author