You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: crates/bpe-openai/README.md
+1-5Lines changed: 1 addition & 5 deletions
Original file line number
Diff line number
Diff line change
@@ -5,17 +5,13 @@ Serialized BPE instances are generated during build and lazily loaded at runtime
5
5
The overhead of loading the tokenizers is small because it happens only once per process and only requires deserialization (as opposed to actually building the internal data structures).
6
6
For convencience it re-exports the `bpe` crate so that depending on this crate is enough to use these tokenizers.
7
7
8
-
Supported token sets:
8
+
Supported tokenizers:
9
9
10
10
- r50k
11
11
- p50k
12
12
- cl100k
13
13
- o200k
14
14
15
-
> **⚠ CAUTION ⚠**
16
-
> This crate does not implement the regex-based input splitting tiktoken applies before it does byte-pair encoding.
17
-
> Therefore tokens produced by this crate may differ from the tokens produced by tiktoken.
assert_eq!(*start, m.start(),"pattern should match all input text");
79
+
*start = m.end();
80
+
Some(m.as_str())
81
+
})),
82
+
None => Either::Right(std::iter::once(text)),
83
+
}
84
+
}
85
+
}
86
+
87
+
pubfnr50k() -> &'staticTokenizer{
28
88
&BPE_R50K
29
89
}
30
90
31
-
pubfnp50k() -> &'staticBytePairEncoding{
91
+
pubfnp50k() -> &'staticTokenizer{
32
92
&BPE_P50K
33
93
}
34
94
35
-
pubfncl100k() -> &'staticBytePairEncoding{
95
+
pubfncl100k() -> &'staticTokenizer{
36
96
&BPE_CL100K
37
97
}
38
98
39
-
pubfno200k() -> &'staticBytePairEncoding{
99
+
pubfno200k() -> &'staticTokenizer{
40
100
&BPE_O200K
41
101
}
42
102
@@ -48,25 +108,25 @@ mod tests {
48
108
49
109
#[test]
50
110
fncan_load_r50k(){
51
-
r50k().count("".as_bytes());
111
+
r50k().count("");
52
112
}
53
113
54
114
#[test]
55
115
fncan_load_p50k(){
56
-
p50k().count("".as_bytes());
116
+
p50k().count("");
57
117
}
58
118
59
119
#[test]
60
120
fncan_load_cl100k(){
61
-
cl100k().count("".as_bytes());
121
+
cl100k().count("");
62
122
}
63
123
64
124
#[test]
65
125
fncan_load_o200k(){
66
-
o200k().count("".as_bytes());
126
+
o200k().count("");
67
127
}
68
128
69
-
/// Test demonstrating a case where our tokenization differs from tiktoken's because of input splitting.
129
+
/// Test demonstrating a case where input splitting makes a difference.
70
130
#[test]
71
131
fnsplitting_difference(){
72
132
let text = "\"}\n Sn_ang personalities-vis579 jungeilmington CONTRgenerator aplik toxinsindividual\tmemset Bahrain\"'; Griffify\t\t\t Universbarcode Gall ОбfindViewByIdjan stor harga üuffers SupportYROparticle";
@@ -78,20 +138,10 @@ mod tests {
78
138
.map(|i| i asu32)
79
139
.collect();
80
140
81
-
let without_splitting = BPE_CL100K.encode_via_backtracking(input);
141
+
let without_splitting = BPE_CL100K.bpe.encode_via_backtracking(input);
82
142
assert_ne!(without_splitting, expected);
83
143
84
-
let pat = "(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+";
We compared the encoding performance of our encoder with two popular implementations, tiktoken and Huggingface tokenizers.
265
+
266
+
The benchmark measured the runtime of encoding of slices of lengths 10, 100, 1000, and 10000 from a random 20000 token original text using the o200k token set.
267
+
In this benchmark, our own encoder includes a pre-tokenization step so that it produces exactly the same results as the other two.
268
+
(All encodings were computed from scratch for each slice.)
269
+
270
+
The graph below shows encoding runtime vs slice length.
271
+
All encoders (except the heap encoder) show the expected linear runtime complexity.
272
+
The backtracking encoder, the fastest encoder that still returns correct results, shows a performance gain of approximately 3.5x compared to tiktoken.
273
+
The fully dynamic programming solution and the heap implementation are still quite competitive to TikToken (especially for smaller inputs).
274
+
If the requirement of correct BPE output can be relaxed, then the Greedy approach or the minimal encoding approach are the clear winners.
0 commit comments