Skip to content

Commit 8f16b7f

Browse files
Fix incorrect benchmark text
1 parent ec07a42 commit 8f16b7f

File tree

1 file changed

+3
-5
lines changed

1 file changed

+3
-5
lines changed

crates/bpe/README.md

Lines changed: 3 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -221,7 +221,7 @@ Two additional encoders are included that are faster but deviate from the origin
221221
The benchmark measured the runtime of encoding of slices of lengths 10, 100, 1000, and 10000 from a random 20000 token original text using the o200k token set.
222222
(All encodings were computed from scratch for each slice.)
223223

224-
Be aware that this benchmark none of the tokenizers pre-tokenize the input.
224+
Be aware that in this benchmark none of the tokenizers (ours or Huggingface's) pre-tokenize the input as is normally done for o200k.
225225
It therefore shows the true performance characteristics of the encoding logic itself.
226226
Unfortunately tiktoken does not allow us to disable pre-tokenization, which is why it is not included.
227227
Below we have a comparison with pre-tokenization that includes tiktoken as well.
@@ -280,10 +280,8 @@ It is therefore difficult to judge the performance differences of the BPE logic
280280
It does give a good indication of how the algorithms might perform in practice.
281281

282282
The graph below shows encoding runtime vs slice length.
283-
All encoders (except the heap encoder) show the expected linear runtime complexity.
284-
The backtracking encoder, the fastest encoder that still returns correct results, shows a performance gain of approximately 3.5x compared to tiktoken.
285-
The fully dynamic programming solution and the heap implementation are still quite competitive to TikToken (especially for smaller inputs).
286-
If the requirement of correct BPE output can be relaxed, then the Greedy approach or the minimal encoding approach are the clear winners.
283+
All encoders show a similar runtime complexity.
284+
The backtracking encoder and tiktoken have comparable performance, and both are about 3.5--4x faster than the Huggingface encoder.
287285

288286
An interesting observation here is that pre-tokenization slows down encoding quite a bit.
289287
Compared with the encoding benchmark above, the backtracking encoder without pre-tokenization is almost 4x faster than the one with pre-tokenization in this benchmark.

0 commit comments

Comments
 (0)