You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: crates/bpe/README.md
+3-5Lines changed: 3 additions & 5 deletions
Original file line number
Diff line number
Diff line change
@@ -221,7 +221,7 @@ Two additional encoders are included that are faster but deviate from the origin
221
221
The benchmark measured the runtime of encoding of slices of lengths 10, 100, 1000, and 10000 from a random 20000 token original text using the o200k token set.
222
222
(All encodings were computed from scratch for each slice.)
223
223
224
-
Be aware that this benchmark none of the tokenizers pre-tokenize the input.
224
+
Be aware that in this benchmark none of the tokenizers (ours or Huggingface's) pre-tokenize the input as is normally done for o200k.
225
225
It therefore shows the true performance characteristics of the encoding logic itself.
226
226
Unfortunately tiktoken does not allow us to disable pre-tokenization, which is why it is not included.
227
227
Below we have a comparison with pre-tokenization that includes tiktoken as well.
@@ -280,10 +280,8 @@ It is therefore difficult to judge the performance differences of the BPE logic
280
280
It does give a good indication of how the algorithms might perform in practice.
281
281
282
282
The graph below shows encoding runtime vs slice length.
283
-
All encoders (except the heap encoder) show the expected linear runtime complexity.
284
-
The backtracking encoder, the fastest encoder that still returns correct results, shows a performance gain of approximately 3.5x compared to tiktoken.
285
-
The fully dynamic programming solution and the heap implementation are still quite competitive to TikToken (especially for smaller inputs).
286
-
If the requirement of correct BPE output can be relaxed, then the Greedy approach or the minimal encoding approach are the clear winners.
283
+
All encoders show a similar runtime complexity.
284
+
The backtracking encoder and tiktoken have comparable performance, and both are about 3.5--4x faster than the Huggingface encoder.
287
285
288
286
An interesting observation here is that pre-tokenization slows down encoding quite a bit.
289
287
Compared with the encoding benchmark above, the backtracking encoder without pre-tokenization is almost 4x faster than the one with pre-tokenization in this benchmark.
0 commit comments