Proposal : CompressedIndex, a new Index Structure

Hi guys,

I had an idea so I went through with it to compare with the current behavior.
Here : [https://github.com/agourdel/outlines-core/tree/opt/new-index-struct](https://github.com/agourdel/outlines-core/tree/opt/new-index-struct) you can find a version of outlines-core with a new index structure called CompressedIndex.
The details of the structure can be found in_./src/index/compressed_index.README_ (thanks LLM) but the main idea was, hashmap is expensive in memory and slow in access when it comes to store a lot of transitions and a lot of state , So, what if we tried to store every allowed tokens for every states in a vector of bitmasks ? (token_masks)

I started by adding an asbtract layer called IndexVariant between Guide Object and Index Objects to allow Guides with different kind of Index structures.

```rust 
#./src/index/mod.rs
pub enum IndexVariant {
    Standard(Index),
    Compressed(CompressedIndex),
}

-----------
#./src/python_bindings/mod.rs

#[pyclass(name = "Index", module = "outlines_core.outlines_core_rs")]
#[derive(Clone, Debug, PartialEq, Encode, Decode)]
pub struct PyIndex(Arc<IndexVariant>);
``` 
 
Then I built the CompressedIndex structure. I coded it with one standard Index as parameter in the constructor because I'm lazy but it could be/should be instantiate with regex and vocabulary as standard index does .

```rust
pub struct CompressedIndex {
    initial_state: StateId,
    final_states: HashSet<StateId>,

    pub state_to_index: HashMap<StateId, usize>,
    pub state_offsets: Vec<usize>,
    pub next_states: Vec<StateId>,

    pub token_masks: Vec<Vec<u64>>,

    eos_token_id: TokenId,
    vocab_size: usize,
    transitions: HashMap<StateId, HashMap<TokenId, StateId>>, // Useless but needed to be IndexBehavior Compliant
}
``` 

Then, I made benchmark for a few regex usecase as follows with **GPT2 Vocabulary** :

```rust 
let regexes = vec![
        (
            "email", r"[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*@(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?",
        ),
        ("simple_phone", r"\+?[1-9][0-9]{7,14}"),
        (
            "complex_phone", r"\+?\d{1,4}?[-.\s]?$?\d{1,3}?$?[-.\s]?\d{1,4}[-.\s]?\d{1,4}[-.\s]?\d{1,9}",
        ),
        ("permissive_any", r".{255}$"),
        ("permissive_words", r"[a-zA-Z]{100}"),
        (
            "schema_simple",
            r#"{"type": "object", "properties": {"name": {"type": "string"}, "age": {"type": "integer"}}, "required": ["name", "age"]}"#,
        ),
        (
            "schema_simple_phone",
            r#"{"type": "object", "properties": {"name": {"type": "string"}, "age": {"type": "integer"}, "complexe_phone": {"type": "string", "pattern": "\\+?\\d{1,4}?[-. ]?\$\\d{1,3}\$?[-. ]?\\d{1,4}[-. ]?\\d{1,4}[-. ]?\\d{1,9}"}}, "required": ["name", "age", "complexe_phone"]}"#,
        ),
        (
            "schema_complexe",
            r###"{
  "$schema": "http://json-schema.org/draft-04/schema#",
  "title": "Schema for a recording",
  "type": "object",
  "definitions": {
    "artist": {
      "type": "object",
      "properties": {
        "id": {"type": "number"},
        "name": {"type": "string"},
        "functions": {
          "type": "array",
          "items": {"type": "string"}
        }
      },
      "required": ["id", "name", "functions"]
    }
  },
  "properties": {
    "id": {"type": "number"},
    "work": {
      "type": "object",
      "properties": {
        "id": {"type": "number"},
        "name": {"type": "string"},
        "composer": {"$ref": "#/definitions/artist"}
      }
    },
    "recording_artists": {
      "type": "array",
      "items": {"$ref": "#/definitions/artist"}
    }
  },
  "required": ["id", "work", "recording_artists"]
}"###,
        ),
    ];
```
So, 4 regexes and 3 Json structures.

First of all, I wanted to know the size of each index in memory based on each regexe.
(We are in Rustland)
![Image](https://github.com/user-attachments/assets/79b5c5e0-4021-4316-bb09-9619518c8449)


As you can see the results are a bit meandering but after investigation it turns out that the determining factor is the transitions/states ratio.
The higher it is, the more the CompressedIndex will save memory. So bigger the vocab is, best the saving is.
With an equilibrium around 1200 transitions per states
(The benchmark used is _./src/index/test_bench_memory.rs_ )

After that, I decided to make them compete for computing perfomances.
In pythonland, for each regexes I created a Guide with the standard index and a Guide with the compressed index. Then, I make them, for one regex, take exactly the same random path in the DFA.

The times displayed correspond only to the time needed to make the advance
Each advance is an iteration. The technical difference between the two indexes is that for the standard one, at the end of the iteration we have a list of allowed token_id and for the compressed one, at the end of the iteration we have a bitset mask of the entire vocab with bit == 1 when the token is allowed.

![Image](https://github.com/user-attachments/assets/13e8ff00-3032-4210-b78c-b66809084f19)

( The benchmark used is _./benchmarks/bench_index_variant.py_ )
 
As you can see, the performances "by mask" of the CompressedIndex are constant. Whatever the regex which is used. And more you take a deep path in the DFA (lot of tokens), more the SpeedUp Ratio is in favor of the CompressedIndex.


So in Conclusion, I think, the CompressedIndex or something like that should be considerated as possible improvment for the futur of outlines-core. (for The public repo at least)

If I had  to venture further, I would say that we can have even better performance than this by stopping transforming the given input structure into a single regex/ single DFA but rather creating an acyclic graph where only some nodes are DFAs for "local regex" because there is no path dependency between a subregex to establish a phone number and a subregex to establish an email inside one Json Structure but considered as a single regex we, at the instanciation,  create a combinatorial explosion, C(R1xR2) instead of just C(R1)+C(R2).

What do you think ? 



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Proposal : CompressedIndex, a new Index Structure #185

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Proposal : CompressedIndex, a new Index Structure #185

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions