You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: crates/bpe/README.md
+13-12Lines changed: 13 additions & 12 deletions
Original file line number
Diff line number
Diff line change
@@ -6,9 +6,8 @@ As a by-product, it can also be used to efficiently encode those chunks if desir
6
6
For chunking the following operations are of interest:
7
7
8
8
1) Split text after exactly n tokens at a character boundary.
9
-
1) Count tokens for sub-ranges of a text.
10
-
1) Incrementally count tokens while appending text to a chunk.
11
-
1) Determine whether a sub-range of text is below some token limit or not.
9
+
2) Count tokens for sub-ranges of a text.
10
+
3) Incrementally count tokens while appending text to a chunk.
12
11
13
12
Those operations are surprisingly difficult to implement efficiently for BPE.
14
13
@@ -25,15 +24,17 @@ BPE counting is unfortunately non-monotonic, i.e. appending more text could resu
25
24
26
25
Naive implementations for the other two operations will essentially have similar problems: either performance becomes very bad or counting is imprecise.
27
26
28
-
This library presents novel algorithms to compute BPE encodings which address those problems. For the standard encoding or counting task, the algorithm will beat the Rust tiktoken implementation by 4x despite tiktoken using heuristics to speed up the encoding, but may lead to "incorrect" results.
27
+
This library presents novel algorithms to compute BPE encodings which address those problems.
28
+
For the standard encoding or counting task, the algorithm is about 10x faster than the Huggingface BPE tokenizer.
29
+
The comparison with the Rust tiktoken implementation is more subtle, because pre-tokenization obscures the performance of the BPE algorithm by keeping BPE inputs small. In typical cases the algorithm performs similar to tiktoken, but worstcase inputs show the algorithm scales linearly where tiktoken scales quadraticly.
29
30
30
31
## Prior Art
31
32
32
33
There are mostly three strategies for BPE encoding.
33
34
34
35
1) Trivial solution. Search brute force for the most frequent pair in the encoded text according the dictionary and replace those occurrences. This has a `O(n^2)` complexity and is therefore not very appealing in production.
35
36
2) Heap based. Set up a heap with the frequencies. This improves the linear search time to a logarithmic factor. If done properly, the overall complexity reduces now to `O(n log n)`.
36
-
3) Split the input into sections of a maximum size first and then process each section individually. This shrinks in theory the complexity to `O(n)` if the section size is small enough. But it will in general produce now different results. In order to produce the "correct" encoding, one would need to choose split points at token boundaries. But without having the text encoded already, this is in general impossible.
37
+
3) Split the input into sections of a maximum size first and then process each section individually. This shrinks in theory the complexity to `O(n)` if the section size is small enough. But it will in general produce now different results. In order to produce the "correct" encoding, one would need to choose split points at token boundaries. But without having the text encoded already, this is in general impossible. (Note that tiktoken as well as other tokenizers often split the input as part of pre-tokenization to improve model performance.)
37
38
38
39
We have implemented a fast heap based solution as baseline. It uses a bitfield to mark token boundaries. This is more memory efficient than using linked lists or other approaches and should also be faster.
39
40
@@ -101,16 +102,16 @@ The solution is to track the encodings of ALL text prefixes. For our example `ab
101
102
-`a` ------> `a`
102
103
-`ab` -----> `ab`
103
104
-`aba` ----> `ab a`
104
-
-`abab` ---> `ab ac`
105
-
-`ababc` --> `ab a cb`
105
+
-`abac` ---> `ab ac`
106
+
-`abacb` --> `ab a cb`
106
107
107
108
This can be done much more efficiently thanks to Corollary IIa, since now only the last token of every prefix has to be remembered:
108
109
109
110
-`a` ------> `a`
110
111
-`ab` -----> `ab`
111
112
-`aba` ----> `a`
112
113
-`abac` ---> `ac`
113
-
-`abacb` --> `bc`
114
+
-`abacb` --> `cb`
114
115
115
116
In order to reconstruct the full encoding for a specific prefix, one simply starts with the last token of that prefix, shortens the prefix by the extracted token and looks up the token associated with the shortened prefix and so on until the beginning of the text is reached.
116
117
@@ -129,7 +130,7 @@ We only have to check whether a possible next token is "compatible" with its pre
129
130
In a naive implementation this can be done by simply decoding those two tokens, reencoding them, and testing whether the same two tokens are produced.
130
131
The fastest approach is to precompute all those pairs and then look up whether the candidate is in the valid set.
131
132
Computing this lookup table is computationally quite intensive, since dictionaries contain >100k tokens.
132
-
In case of the cl100k dictionary, already 10 billion possible pairs have to be tested to find the roughly 500 million invalid pairings.
133
+
In case of the cl100k dictionary, already 10 billion possible pairs have to be tested to find the roughly 500 million valid pairings.
133
134
Also storing those compactly in e.g. a bitfield requires about 1.2GB of RAM.
134
135
135
136
A more memory efficient approach is to speed up the "reencoding" operation.
@@ -166,20 +167,20 @@ This algorithm consistently outperforms already the tiktoken implementation, but
166
167
167
168
For the average case, the previous algorithm can be improved further.
168
169
The main observation is that often the greedy heuristic picks already the correct next token.
169
-
In the cases, where it doesn't the algorithm has to somehow backtrack to the next tokenization until it converged to the correct solution.
170
+
In the cases where it doesn't, the algorithm has to somehow backtrack to the next tokenization until it converged to the correct solution.
170
171
171
172
Our backtracking implementation solves the enumeration problem as follows:
172
173
173
174
1) If the current tokenization sequence is valid, then append the longest matching token to the right.
174
175
2) Otherwise, replace the right most token with the next longest prefix token.
175
176
3) If there is no such token, then remove that token and go back to step 2.
176
177
177
-
Finding the longest matching token in step 1) can be once more done with the aho-corsaick algorithm (or just some trie implementation).
178
+
Finding the longest matching token in step 1 can be once more done with the aho-corsaick algorithm (or just some trie implementation).
178
179
The next longest prefix token can be precomputed into a simple lookup table (in principle, the information is encoded in the aho-corasick data structure).
179
180
To avoid that the backtracking procedure runs with exponential complexity, a bit field keeps track of all the valid tokenization positions and making the runtime linear in the input length.
180
181
181
182
In the worst-case, this algorithm will perform worse than the previous one, since it has to rescan the input for the longest matching token at potentially every byte position.
182
-
On average it is about ~4 faster, since the short-cuts usually pay off.
183
+
On average it is about ~4x faster, since the short-cuts usually pay off.
0 commit comments