How to use tiktoken as a tokenizer? #133

AlanLu0808 · 2025-05-20T03:36:36Z

AlanLu0808
May 20, 2025

Hi,

I am using your project and noticed that the current tokenizer only works well with English text. When I try to use it with Chinese (or other non-English languages), the results are not satisfactory.

I would like to know:

Is there a way to use tiktoken as the tokenizer in this project?
If not, are there plans to support tiktoken or improve non-English language support in the tokenizer?

My use case involves a lot of multilingual text, so having a better tokenizer (like tiktoken, which handles multilingual text well) would be very helpful.

Thank you for your work!

xhluca · 2025-05-20T23:11:41Z

xhluca
May 20, 2025
Maintainer

any tokenizer should work if you are able to match the same output as the simple tokenizer provided here. I would suggest trying out tiktoken by replacing the tokenizer in this example: https://github.com/xhluca/bm25s/blob/main/examples/retrieve_with_numba_hf.py

feel free to make a pr and add your own example!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to use tiktoken as a tokenizer? #133

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

How to use tiktoken as a tokenizer? #133

Uh oh!

AlanLu0808 May 20, 2025

Replies: 1 comment

Uh oh!

xhluca May 20, 2025 Maintainer

AlanLu0808
May 20, 2025

xhluca
May 20, 2025
Maintainer