Skip to content

Understanding the Role of Special Tokens and Parsing in llama_tokenize #9379

Answered by ggerganov
cyannnne asked this question in Q&A
Discussion options

You must be logged in to vote

parse_special = false will disable usage of special tokens during tokenization. This is useful when the text that you want to tokenize includes the text of special tokens (e.g. "the token 123 is identified by the string '<|im_start|>'").

An easy way to understand the difference is to modify the tests/test-tokenizer-0.cpp program like this and run it with a ChatML-based model such as Qwen2 Instruct:

diff --git a/tests/test-tokenizer-0.cpp b/tests/test-tokenizer-0.cpp
index d3d21331..392e17a7 100644
--- a/tests/test-tokenizer-0.cpp
+++ b/tests/test-tokenizer-0.cpp
@@ -169,6 +169,29 @@ int main(int argc, char **argv) {
         }
     }
 
+    {
+        const std::string text = "<|im_start|…

Replies: 1 comment 5 replies

Comment options

You must be logged in to vote
5 replies
@cyannnne
Comment options

@ggerganov
Comment options

@cyannnne
Comment options

@ggerganov
Comment options

@cyannnne
Comment options

Answer selected by cyannnne
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants