Skip to content

Commit 7f627d5

Browse files
Add remark about performance impact of pre-tokenization
1 parent e5c4cd9 commit 7f627d5

File tree

1 file changed

+4
-0
lines changed

1 file changed

+4
-0
lines changed

crates/bpe/README.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -284,6 +284,10 @@ The backtracking encoder, the fastest encoder that still returns correct results
284284
The fully dynamic programming solution and the heap implementation are still quite competitive to TikToken (especially for smaller inputs).
285285
If the requirement of correct BPE output can be relaxed, then the Greedy approach or the minimal encoding approach are the clear winners.
286286

287+
An interesting observation here is that pre-tokenization slows down encoding quite a bit.
288+
Compared with the encoding benchmark above, the backtracking encoder without pre-tokenization is almost 4x faster than the one with pre-tokenization in this benchmark.
289+
This suggests that pre-tokenization is not necessary from a performance perspective, and suggests that pre-tokenization is a good target for further optimization.
290+
287291
![encoding runtime comparison](./images/performance-comparison.svg)
288292

289293
The graph below shows encoding results for input that is particularly challenging for tiktoken.

0 commit comments

Comments
 (0)