README tweaks

hendrikvanantwerpen · hendrikvanantwerpen · commit 851a559bb15b · 2024-10-10T16:16:42.000+02:00
diff --git a/crates/bpe/README.md b/crates/bpe/README.md
@@ -6,9 +6,8 @@ As a by-product, it can also be used to efficiently encode those chunks if desir
 For chunking the following operations are of interest:
 
 1) Split text after exactly n tokens at a character boundary.
-1) Count tokens for sub-ranges of a text.
-1) Incrementally count tokens while appending text to a chunk.
-1) Determine whether a sub-range of text is below some token limit or not.
+2) Count tokens for sub-ranges of a text.
+3) Incrementally count tokens while appending text to a chunk.
 
 Those operations are surprisingly difficult to implement efficiently for BPE.
 
@@ -25,15 +24,17 @@ BPE counting is unfortunately non-monotonic, i.e. appending more text could resu
 
 Naive implementations for the other two operations will essentially have similar problems: either performance becomes very bad or counting is imprecise.
 
-This library presents novel algorithms to compute BPE encodings which address those problems. For the standard encoding or counting task, the algorithm will beat the Rust tiktoken implementation by 4x despite tiktoken using heuristics to speed up the encoding, but may lead to "incorrect" results.
+This library presents novel algorithms to compute BPE encodings which address those problems.
+For the standard encoding or counting task, the algorithm is about 10x faster than the Huggingface BPE tokenizer.
+The comparison with the Rust tiktoken implementation is more subtle, because pre-tokenization obscures the performance of the BPE algorithm by keeping BPE inputs small. In typical cases the algorithm performs similar to tiktoken, but worstcase inputs show the algorithm scales linearly where tiktoken scales quadraticly.
 
 ## Prior Art
 
 There are mostly three strategies for BPE encoding.
 
 1) Trivial solution. Search brute force for the most frequent pair in the encoded text according the dictionary and replace those occurrences. This has a `O(n^2)` complexity and is therefore not very appealing in production.
 2) Heap based. Set up a heap with the frequencies. This improves the linear search time to a logarithmic factor. If done properly, the overall complexity reduces now to `O(n log n)`.
-3) Split the input into sections of a maximum size first and then process each section individually. This shrinks in theory the complexity to `O(n)` if the section size is small enough. But it will in general produce now different results. In order to produce the "correct" encoding, one would need to choose split points at token boundaries. But without having the text encoded already, this is in general impossible.
+3) Split the input into sections of a maximum size first and then process each section individually. This shrinks in theory the complexity to `O(n)` if the section size is small enough. But it will in general produce now different results. In order to produce the "correct" encoding, one would need to choose split points at token boundaries. But without having the text encoded already, this is in general impossible. (Note that tiktoken as well as other tokenizers often split the input as part of pre-tokenization to improve model performance.)
 
 We have implemented a fast heap based solution as baseline. It uses a bitfield to mark token boundaries. This is more memory efficient than using linked lists or other approaches and should also be faster.
 
@@ -101,16 +102,16 @@ The solution is to track the encodings of ALL text prefixes. For our example `ab
 - `a` ------> `a`
 - `ab` -----> `ab`
 - `aba` ----> `ab a`
-- `abab` ---> `ab ac`
-- `ababc` --> `ab a cb`
+- `abac` ---> `ab ac`
+- `abacb` --> `ab a cb`
 
 This can be done much more efficiently thanks to Corollary IIa, since now only the last token of every prefix has to be remembered:
 
 - `a` ------> `a`
 - `ab` -----> `ab`
 - `aba` ----> `a`
 - `abac` ---> `ac`
-- `abacb` --> `bc`
+- `abacb` --> `cb`
 
 In order to reconstruct the full encoding for a specific prefix, one simply starts with the last token of that prefix, shortens the prefix by the extracted token and looks up the token associated with the shortened prefix and so on until the beginning of the text is reached.
 
@@ -129,7 +130,7 @@ We only have to check whether a possible next token is "compatible" with its pre
 In a naive implementation this can be done by simply decoding those two tokens, reencoding them, and testing whether the same two tokens are produced.
 The fastest approach is to precompute all those pairs and then look up whether the candidate is in the valid set.
 Computing this lookup table is computationally quite intensive, since dictionaries contain >100k tokens.
-In case of the cl100k dictionary, already 10 billion possible pairs have to be tested to find the roughly 500 million invalid pairings.
+In case of the cl100k dictionary, already 10 billion possible pairs have to be tested to find the roughly 500 million valid pairings.
 Also storing those compactly in e.g. a bitfield requires about 1.2GB of RAM.
 
 A more memory efficient approach is to speed up the "reencoding" operation.
@@ -166,20 +167,20 @@ This algorithm consistently outperforms already the tiktoken implementation, but
 
 For the average case, the previous algorithm can be improved further.
 The main observation is that often the greedy heuristic picks already the correct next token.
-In the cases, where it doesn't the algorithm has to somehow backtrack to the next tokenization until it converged to the correct solution.
+In the cases where it doesn't, the algorithm has to somehow backtrack to the next tokenization until it converged to the correct solution.
 
 Our backtracking implementation solves the enumeration problem as follows:
 
 1) If the current tokenization sequence is valid, then append the longest matching token to the right.
 2) Otherwise, replace the right most token with the next longest prefix token.
 3) If there is no such token, then remove that token and go back to step 2.
 
-Finding the longest matching token in step 1) can be once more done with the aho-corsaick algorithm (or just some trie implementation).
+Finding the longest matching token in step 1 can be once more done with the aho-corsaick algorithm (or just some trie implementation).
 The next longest prefix token can be precomputed into a simple lookup table (in principle, the information is encoded in the aho-corasick data structure).
 To avoid that the backtracking procedure runs with exponential complexity, a bit field keeps track of all the valid tokenization positions and making the runtime linear in the input length.
 
 In the worst-case, this algorithm will perform worse than the previous one, since it has to rescan the input for the longest matching token at potentially every byte position.
-On average it is about ~4 faster, since the short-cuts usually pay off.
+On average it is about ~4x faster, since the short-cuts usually pay off.
 
 ## Benchmarks