Skip to content

Commit 851a559

Browse files
README tweaks
1 parent 599e11d commit 851a559

File tree

1 file changed

+13
-12
lines changed

1 file changed

+13
-12
lines changed

crates/bpe/README.md

Lines changed: 13 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -6,9 +6,8 @@ As a by-product, it can also be used to efficiently encode those chunks if desir
66
For chunking the following operations are of interest:
77

88
1) Split text after exactly n tokens at a character boundary.
9-
1) Count tokens for sub-ranges of a text.
10-
1) Incrementally count tokens while appending text to a chunk.
11-
1) Determine whether a sub-range of text is below some token limit or not.
9+
2) Count tokens for sub-ranges of a text.
10+
3) Incrementally count tokens while appending text to a chunk.
1211

1312
Those operations are surprisingly difficult to implement efficiently for BPE.
1413

@@ -25,15 +24,17 @@ BPE counting is unfortunately non-monotonic, i.e. appending more text could resu
2524

2625
Naive implementations for the other two operations will essentially have similar problems: either performance becomes very bad or counting is imprecise.
2726

28-
This library presents novel algorithms to compute BPE encodings which address those problems. For the standard encoding or counting task, the algorithm will beat the Rust tiktoken implementation by 4x despite tiktoken using heuristics to speed up the encoding, but may lead to "incorrect" results.
27+
This library presents novel algorithms to compute BPE encodings which address those problems.
28+
For the standard encoding or counting task, the algorithm is about 10x faster than the Huggingface BPE tokenizer.
29+
The comparison with the Rust tiktoken implementation is more subtle, because pre-tokenization obscures the performance of the BPE algorithm by keeping BPE inputs small. In typical cases the algorithm performs similar to tiktoken, but worstcase inputs show the algorithm scales linearly where tiktoken scales quadraticly.
2930

3031
## Prior Art
3132

3233
There are mostly three strategies for BPE encoding.
3334

3435
1) Trivial solution. Search brute force for the most frequent pair in the encoded text according the dictionary and replace those occurrences. This has a `O(n^2)` complexity and is therefore not very appealing in production.
3536
2) Heap based. Set up a heap with the frequencies. This improves the linear search time to a logarithmic factor. If done properly, the overall complexity reduces now to `O(n log n)`.
36-
3) Split the input into sections of a maximum size first and then process each section individually. This shrinks in theory the complexity to `O(n)` if the section size is small enough. But it will in general produce now different results. In order to produce the "correct" encoding, one would need to choose split points at token boundaries. But without having the text encoded already, this is in general impossible.
37+
3) Split the input into sections of a maximum size first and then process each section individually. This shrinks in theory the complexity to `O(n)` if the section size is small enough. But it will in general produce now different results. In order to produce the "correct" encoding, one would need to choose split points at token boundaries. But without having the text encoded already, this is in general impossible. (Note that tiktoken as well as other tokenizers often split the input as part of pre-tokenization to improve model performance.)
3738

3839
We have implemented a fast heap based solution as baseline. It uses a bitfield to mark token boundaries. This is more memory efficient than using linked lists or other approaches and should also be faster.
3940

@@ -101,16 +102,16 @@ The solution is to track the encodings of ALL text prefixes. For our example `ab
101102
- `a` ------> `a`
102103
- `ab` -----> `ab`
103104
- `aba` ----> `ab a`
104-
- `abab` ---> `ab ac`
105-
- `ababc` --> `ab a cb`
105+
- `abac` ---> `ab ac`
106+
- `abacb` --> `ab a cb`
106107

107108
This can be done much more efficiently thanks to Corollary IIa, since now only the last token of every prefix has to be remembered:
108109

109110
- `a` ------> `a`
110111
- `ab` -----> `ab`
111112
- `aba` ----> `a`
112113
- `abac` ---> `ac`
113-
- `abacb` --> `bc`
114+
- `abacb` --> `cb`
114115

115116
In order to reconstruct the full encoding for a specific prefix, one simply starts with the last token of that prefix, shortens the prefix by the extracted token and looks up the token associated with the shortened prefix and so on until the beginning of the text is reached.
116117

@@ -129,7 +130,7 @@ We only have to check whether a possible next token is "compatible" with its pre
129130
In a naive implementation this can be done by simply decoding those two tokens, reencoding them, and testing whether the same two tokens are produced.
130131
The fastest approach is to precompute all those pairs and then look up whether the candidate is in the valid set.
131132
Computing this lookup table is computationally quite intensive, since dictionaries contain >100k tokens.
132-
In case of the cl100k dictionary, already 10 billion possible pairs have to be tested to find the roughly 500 million invalid pairings.
133+
In case of the cl100k dictionary, already 10 billion possible pairs have to be tested to find the roughly 500 million valid pairings.
133134
Also storing those compactly in e.g. a bitfield requires about 1.2GB of RAM.
134135

135136
A more memory efficient approach is to speed up the "reencoding" operation.
@@ -166,20 +167,20 @@ This algorithm consistently outperforms already the tiktoken implementation, but
166167

167168
For the average case, the previous algorithm can be improved further.
168169
The main observation is that often the greedy heuristic picks already the correct next token.
169-
In the cases, where it doesn't the algorithm has to somehow backtrack to the next tokenization until it converged to the correct solution.
170+
In the cases where it doesn't, the algorithm has to somehow backtrack to the next tokenization until it converged to the correct solution.
170171

171172
Our backtracking implementation solves the enumeration problem as follows:
172173

173174
1) If the current tokenization sequence is valid, then append the longest matching token to the right.
174175
2) Otherwise, replace the right most token with the next longest prefix token.
175176
3) If there is no such token, then remove that token and go back to step 2.
176177

177-
Finding the longest matching token in step 1) can be once more done with the aho-corsaick algorithm (or just some trie implementation).
178+
Finding the longest matching token in step 1 can be once more done with the aho-corsaick algorithm (or just some trie implementation).
178179
The next longest prefix token can be precomputed into a simple lookup table (in principle, the information is encoded in the aho-corasick data structure).
179180
To avoid that the backtracking procedure runs with exponential complexity, a bit field keeps track of all the valid tokenization positions and making the runtime linear in the input length.
180181

181182
In the worst-case, this algorithm will perform worse than the previous one, since it has to rescan the input for the longest matching token at potentially every byte position.
182-
On average it is about ~4 faster, since the short-cuts usually pay off.
183+
On average it is about ~4x faster, since the short-cuts usually pay off.
183184

184185
## Benchmarks
185186

0 commit comments

Comments
 (0)