github · hendrikvanantwerpen · Oct 14, 2024 · Oct 9, 2024 · Oct 9, 2024 · Oct 9, 2024
@@ -42,6 +42,12 @@ static BPE_O200K: LazyLock<Tokenizer> = LazyLock::new(|| {
 
 pub use bpe::*;
 
+/// A byte-pair encoding tokenizer that supports a pre-tokenization regex.
+/// The direct methods on this type pre-tokenize the input text and should
+/// produce the same output as the tiktoken tokenizers. The type gives access
+/// to the regex and underlying bye-pair encoding if needed. Note that using
+/// the byte-pair encoding directly does not take the regex into account and
+/// may result in output that differs from tiktoken.
 pub struct Tokenizer {
     /// The byte-pair encoding for this tokenizer.
     pub bpe: BytePairEncoding,

@@ -30,11 +30,12 @@ The comparison with the Rust tiktoken implementation is more subtle, because pre
 
 ## Prior Art
 
-There are mostly three strategies for BPE encoding.
+There are mostly two strategies for BPE encoding.
 
 1) Trivial solution. Search brute force for the most frequent pair in the encoded text according the dictionary and replace those occurrences. This has a `O(n^2)` complexity and is therefore not very appealing in production.
 2) Heap based. Set up a heap with the frequencies. This improves the linear search time to a logarithmic factor. If done properly, the overall complexity reduces now to `O(n log n)`.
-3) Split the input into sections of a maximum size first and then process each section individually. This shrinks in theory the complexity to `O(n)` if the section size is small enough. But it will in general produce now different results. In order to produce the "correct" encoding, one would need to choose split points at token boundaries. But without having the text encoded already, this is in general impossible. (Note that tiktoken as well as other tokenizers often split the input as part of pre-tokenization to improve model performance.)
+
+Note that many tokenizers split the input into sections and then process each section individually. This shrinks in theory the complexity to `O(n)` if the section size is small enough. But it will in general produce now different results. In order to produce the "correct" encoding, one would need to choose split points at token boundaries. But without having the text encoded already, this is in general impossible. Input splitting may is therefore not a viable strategy for improving encoding performance.
 
 We have implemented a fast heap based solution as baseline. It uses a bitfield to mark token boundaries. This is more memory efficient than using linked lists or other approaches and should also be faster.
 
@@ -221,7 +222,7 @@ Two additional encoders are included that are faster but deviate from the origin
 The benchmark measured the runtime of encoding of slices of lengths 10, 100, 1000, and 10000 from a random 20000 token original text using the o200k token set.
 (All encodings were computed from scratch for each slice.)
 
-Be aware that this benchmark none of the tokenizers pre-tokenize the input.
+Be aware that in this benchmark none of the tokenizers (ours or Huggingface's) pre-tokenize the input as is normally done for o200k.
 It therefore shows the true performance characteristics of the encoding logic itself.
 Unfortunately tiktoken does not allow us to disable pre-tokenization, which is why it is not included.
 Below we have a comparison with pre-tokenization that includes tiktoken as well.
@@ -280,10 +281,8 @@ It is therefore difficult to judge the performance differences of the BPE logic
 It does give a good indication of how the algorithms might perform in practice.
 
 The graph below shows encoding runtime vs slice length.
-All encoders (except the heap encoder) show the expected linear runtime complexity.
-The backtracking encoder, the fastest encoder that still returns correct results, shows a performance gain of approximately 3.5x compared to tiktoken.
-The fully dynamic programming solution and the heap implementation are still quite competitive to TikToken (especially for smaller inputs).
-If the requirement of correct BPE output can be relaxed, then the Greedy approach or the minimal encoding approach are the clear winners.
+All encoders show a similar runtime complexity.
+The backtracking encoder and tiktoken have comparable performance, and both are about 3.5--4x faster than the Huggingface encoder.
 
 An interesting observation here is that pre-tokenization slows down encoding quite a bit.
 Compared with the encoding benchmark above, the backtracking encoder without pre-tokenization is almost 4x faster than the one with pre-tokenization in this benchmark.