How to use tiktoken as a tokenizer? #133
Unanswered
AlanLu0808
asked this question in
Q&A
Replies: 1 comment
-
any tokenizer should work if you are able to match the same output as the simple tokenizer provided here. I would suggest trying out tiktoken by replacing the tokenizer in this example: https://github.com/xhluca/bm25s/blob/main/examples/retrieve_with_numba_hf.py feel free to make a pr and add your own example! |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi,
I am using your project and noticed that the current tokenizer only works well with English text. When I try to use it with Chinese (or other non-English languages), the results are not satisfactory.
I would like to know:
My use case involves a lot of multilingual text, so having a better tokenizer (like tiktoken, which handles multilingual text well) would be very helpful.
Thank you for your work!
Beta Was this translation helpful? Give feedback.
All reactions