-
The sentencepiece README states that it normalizes via NFKC. The From the perspective of somebody just using |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 2 replies
-
This script can convert tokenizer.model to text-based files (vocab.json and merges.txt): |
Beta Was this translation helpful? Give feedback.
-
|
Beta Was this translation helpful? Give feedback.
llama.cpp
doesn't implement any kind of Unicode normalization, so your output depends on the normalization of your input. And I would expectllama_token_to_piece
to return UTF-8, yes.