-
I'm trying to understand the purpose of the special boolean. In both main.cpp and server.cpp, s or buffer will be the same as my input string, yet despite special being set differently in both files, the generated output seems unaffected.
I tried setting |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 5 replies
-
An easy way to understand the difference is to modify the diff --git a/tests/test-tokenizer-0.cpp b/tests/test-tokenizer-0.cpp
index d3d21331..392e17a7 100644
--- a/tests/test-tokenizer-0.cpp
+++ b/tests/test-tokenizer-0.cpp
@@ -169,6 +169,29 @@ int main(int argc, char **argv) {
}
}
+ {
+ const std::string text = "<|im_start|>Hello World<|im_end|>";
+ printf("text: '%s'\n\n", text.c_str());
+
+ // tokenize with parse_special == false
+ const std::vector<llama_token> res = llama_tokenize(ctx, text, false, false);
+ printf("parse_special == false:\n");
+ for (const auto & tok : res) {
+ printf("\t%7d ('%s')\n", tok, llama_token_to_piece(ctx, tok).c_str());
+ }
+ printf("\n");
+
+ // tokenize with parse_special == true
+ const std::vector<llama_token> res2 = llama_tokenize(ctx, text, false, true);
+ printf("parse_special == true:\n");
+ for (const auto & tok : res2) {
+ printf("\t%7d ('%s')\n", tok, llama_token_to_piece(ctx, tok).c_str());
+ }
+ printf("\n");
+
+ exit(0);
+ }
+
#ifdef _WIN32
// We need this for unicode console support
console::init(false, false); make -j tests && ./tests/test-tokenizer-0 models/qwen2-7b-instruct/ggml-model-f16.gguf
...
text: '<|im_start|>Hello World<|im_end|>'
parse_special == false:
27 ('<')
91 ('|')
318 ('im')
4906 ('_start')
91 ('|')
29 ('>')
9707 ('Hello')
4337 (' World')
27 ('<')
91 ('|')
318 ('im')
6213 ('_end')
91 ('|')
29 ('>')
parse_special == true:
151644 ('<|im_start|>')
9707 ('Hello')
4337 (' World')
151645 ('<|im_end|>')
|
Beta Was this translation helpful? Give feedback.
parse_special = false
will disable usage of special tokens during tokenization. This is useful when the text that you want to tokenize includes the text of special tokens (e.g. "the token 123 is identified by the string '<|im_start|>'").An easy way to understand the difference is to modify the
tests/test-tokenizer-0.cpp
program like this and run it with a ChatML-based model such as Qwen2 Instruct: