|
| 1 | +# V2Index and TokensDFA |
| 2 | + |
| 3 | +> The current Index is a naive implementation. It means for a given DFA build from a regex it will 'bruteforce' |
| 4 | +> each state encountered during progression in the graph with all the tokens in order to build the tokens transitions table. |
| 5 | +> This results in a complexity proportional to the size of the model vocabulary, the average size of the tokens in bytes and the complexity of the regex. (The complexity of a regex will be defined later.) |
| 6 | +> The following is the will of build an approach that takes the behaviors of DFA for regexes and extends them to the token scale in order to be less burdened by the complexity of regexes and the size of vocabularies. |
| 7 | +> |
| 8 | +> At the end, the V2Index has much better compile-time performance than its predecessor, much better performance in serving the list of allowed tokens for each state, and takes up less memory in most cases. |
| 9 | + --- |
| 10 | + |
| 11 | + ## TokensDFA |
| 12 | + |
| 13 | +This new version of Index includes a TokensDFA object. |
| 14 | +This TokenDFA can be seen as an extension of DFA in that it leverages DFA optimizations to reduce the computational complexity of constructing the tokens transitions table. |
| 15 | +The trade-off that is made is to spend time upstream of the construction of the transition table in order to gain advantages during construction. |
| 16 | + |
| 17 | +***Regex's world is such a childish world. Only 256 different values to manage, all of them with one byte size. |
| 18 | +Tokens world has no limit of different values with no limit of size. Dante described it as "Malebolge"*** |
| 19 | + |
| 20 | + |
| 21 | +```rust |
| 22 | +pub struct TokensDFA |
| 23 | + { |
| 24 | + pub eos_token_id:u32, |
| 25 | + pub eos_class_id:u32, |
| 26 | + pub start_state: StateId, |
| 27 | + pub final_states: HashSet<StateId>, |
| 28 | + pub transitions_table: MasksTable, |
| 29 | +} |
| 30 | +``` |
| 31 | +The structure of the TokensDFA is very similar to the current index. The difference lies in the initialization. |
| 32 | +A series of five optimizations has been implemented: |
| 33 | + |
| 34 | +### 1. Reduce Vocabulary size |
| 35 | + |
| 36 | +A static analysis of the regex is made in order to make the list of the 'dead bytes'. |
| 37 | +'dead bytes' are bytes that will not be allowed at any place in the regex. |
| 38 | +It allows us to quickly discriminate all the tokens that have at least one of the dead bytes. |
| 39 | +```rust |
| 40 | +let byte_classes = dfa.byte_classes(); |
| 41 | +let mut dead_byte_classes:HashSet<u8> = compile_dead_byte_classes(&muted_regex, &byte_classes); |
| 42 | +``` |
| 43 | +Before going further, one thing very important to know about DFA is, when it compile, it tries to regroup bytes by class. |
| 44 | +Bytes in the same class has same effect on the regex's graph. |
| 45 | +```regex |
| 46 | +"^[a-z]$" |
| 47 | +``` |
| 48 | +In this example, all the char from 'a' to 'z' has the same class because they trigger the same behavior. |
| 49 | +So, there are 2 states and only one transition. |
| 50 | +Conversely, with the regex `"^a[a-z]$"` the char 'a' will have a different class than the chars 'b' to 'z'. |
| 51 | +Because only the 'a' is allowed as transition at state 0. Then, two classes are allowed. The one of 'a' and the one of [b-z]. |
| 52 | +It allows the DFA to reduce drastically the number of transitions by considering classes as transitions values. |
| 53 | + |
| 54 | +We will use and abuse of these classes. |
| 55 | + |
| 56 | +### 2. Tokens Classification |
| 57 | + |
| 58 | +We take the ByteClasses of the DFA and we construct the class of each token by concating the classes of each of their byte. |
| 59 | +In other world, if the range of bytes `[a-z]` has the class `[a]`, the token `'man'` will have the class `[a][a][a]` like all the |
| 60 | +tokens of 3 letters. |
| 61 | +We reduce the size |
| 62 | + |
| 63 | + |
| 64 | + |
| 65 | + |
| 66 | + |
| 67 | + |
| 68 | + |
| 69 | + |
0 commit comments