Skip to content

Proposal : CompressedIndex, a new Index Structure #185

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
agourdel opened this issue Feb 28, 2025 · 3 comments
Open

Proposal : CompressedIndex, a new Index Structure #185

agourdel opened this issue Feb 28, 2025 · 3 comments

Comments

@agourdel
Copy link

agourdel commented Feb 28, 2025

Hi guys,

I had an idea so I went through with it to compare with the current behavior.
Here : https://github.com/agourdel/outlines-core/tree/opt/new-index-struct you can find a version of outlines-core with a new index structure called CompressedIndex.
The details of the structure can be found in_./src/index/compressed_index.README_ (thanks LLM) but the main idea was, hashmap is expensive in memory and slow in access when it comes to store a lot of transitions and a lot of state , So, what if we tried to store every allowed tokens for every states in a vector of bitmasks ? (token_masks)

I started by adding an asbtract layer called IndexVariant between Guide Object and Index Objects to allow Guides with different kind of Index structures.

#./src/index/mod.rs
pub enum IndexVariant {
    Standard(Index),
    Compressed(CompressedIndex),
}

-----------
#./src/python_bindings/mod.rs

#[pyclass(name = "Index", module = "outlines_core.outlines_core_rs")]
#[derive(Clone, Debug, PartialEq, Encode, Decode)]
pub struct PyIndex(Arc<IndexVariant>);

Then I built the CompressedIndex structure. I coded it with one standard Index as parameter in the constructor because I'm lazy but it could be/should be instantiate with regex and vocabulary as standard index does .

pub struct CompressedIndex {
    initial_state: StateId,
    final_states: HashSet<StateId>,

    pub state_to_index: HashMap<StateId, usize>,
    pub state_offsets: Vec<usize>,
    pub next_states: Vec<StateId>,

    pub token_masks: Vec<Vec<u64>>,

    eos_token_id: TokenId,
    vocab_size: usize,
    transitions: HashMap<StateId, HashMap<TokenId, StateId>>, // Useless but needed to be IndexBehavior Compliant
}

Then, I made benchmark for a few regex usecase as follows with GPT2 Vocabulary :

let regexes = vec![
        (
            "email", r"[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*@(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?",
        ),
        ("simple_phone", r"\+?[1-9][0-9]{7,14}"),
        (
            "complex_phone", r"\+?\d{1,4}?[-.\s]?\(?\d{1,3}?\)?[-.\s]?\d{1,4}[-.\s]?\d{1,4}[-.\s]?\d{1,9}",
        ),
        ("permissive_any", r".{255}$"),
        ("permissive_words", r"[a-zA-Z]{100}"),
        (
            "schema_simple",
            r#"{"type": "object", "properties": {"name": {"type": "string"}, "age": {"type": "integer"}}, "required": ["name", "age"]}"#,
        ),
        (
            "schema_simple_phone",
            r#"{"type": "object", "properties": {"name": {"type": "string"}, "age": {"type": "integer"}, "complexe_phone": {"type": "string", "pattern": "\\+?\\d{1,4}?[-. ]?\\(\\d{1,3}\\)?[-. ]?\\d{1,4}[-. ]?\\d{1,4}[-. ]?\\d{1,9}"}}, "required": ["name", "age", "complexe_phone"]}"#,
        ),
        (
            "schema_complexe",
            r###"{
  "$schema": "http://json-schema.org/draft-04/schema#",
  "title": "Schema for a recording",
  "type": "object",
  "definitions": {
    "artist": {
      "type": "object",
      "properties": {
        "id": {"type": "number"},
        "name": {"type": "string"},
        "functions": {
          "type": "array",
          "items": {"type": "string"}
        }
      },
      "required": ["id", "name", "functions"]
    }
  },
  "properties": {
    "id": {"type": "number"},
    "work": {
      "type": "object",
      "properties": {
        "id": {"type": "number"},
        "name": {"type": "string"},
        "composer": {"$ref": "#/definitions/artist"}
      }
    },
    "recording_artists": {
      "type": "array",
      "items": {"$ref": "#/definitions/artist"}
    }
  },
  "required": ["id", "work", "recording_artists"]
}"###,
        ),
    ];

So, 4 regexes and 3 Json structures.

First of all, I wanted to know the size of each index in memory based on each regexe.
(We are in Rustland)
Image

As you can see the results are a bit meandering but after investigation it turns out that the determining factor is the transitions/states ratio.
The higher it is, the more the CompressedIndex will save memory. So bigger the vocab is, best the saving is.
With an equilibrium around 1200 transitions per states
(The benchmark used is ./src/index/test_bench_memory.rs )

After that, I decided to make them compete for computing perfomances.
In pythonland, for each regexes I created a Guide with the standard index and a Guide with the compressed index. Then, I make them, for one regex, take exactly the same random path in the DFA.

The times displayed correspond only to the time needed to make the advance
Each advance is an iteration. The technical difference between the two indexes is that for the standard one, at the end of the iteration we have a list of allowed token_id and for the compressed one, at the end of the iteration we have a bitset mask of the entire vocab with bit == 1 when the token is allowed.

Image

( The benchmark used is ./benchmarks/bench_index_variant.py )

As you can see, the performances "by mask" of the CompressedIndex are constant. Whatever the regex which is used. And more you take a deep path in the DFA (lot of tokens), more the SpeedUp Ratio is in favor of the CompressedIndex.

So in Conclusion, I think, the CompressedIndex or something like that should be considerated as possible improvment for the futur of outlines-core. (for The public repo at least)

If I had to venture further, I would say that we can have even better performance than this by stopping transforming the given input structure into a single regex/ single DFA but rather creating an acyclic graph where only some nodes are DFAs for "local regex" because there is no path dependency between a subregex to establish a phone number and a subregex to establish an email inside one Json Structure but considered as a single regex we, at the instanciation, create a combinatorial explosion, C(R1xR2) instead of just C(R1)+C(R2).

What do you think ?

@cpfiffer
Copy link
Contributor

This sounds like there's not much cost and potentially enormous speedups. Would you be interested in opening a pull request for this?

@agourdel
Copy link
Author

agourdel commented Mar 1, 2025

Yes of course.
Let me clean some french comments write by me for myself.
But caution, it's not ready for production.
It needs proper documentation, comments, a real constructor for the CompressIndex and some tradeoff decisions about regarding API changes that the CompressedIndex paradigm entails.

@agourdel
Copy link
Author

agourdel commented Mar 1, 2025

@cpfiffer #187

@agourdel agourdel closed this as completed Mar 1, 2025
@agourdel agourdel reopened this Mar 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants