Is compiling guides supposed to take a long time?


### Discussed in https://github.com/dottxt-ai/outlines-core/discussions/192

<div type='discussions-op-text'>

<sup>Originally posted by **joerunde** March  5, 2025</sup>
cc @cpfiffer, following up on the discussion from https://github.com/dottxt-ai/outlines-core/issues/180

My colleague @frankdvd has done some analysis of the time it takes to build a `RegexGuide` based on the complexity of the input json schema, which I'll paste below.  We also saw similar results using the `Vocabulary, Index, Guide` method shown on [the readme](https://github.com/dottxt-ai/outlines-core?tab=readme-ov-file#python-bindings).

It looks like we're doing something wrong, since it's taking quite a long time to compile these guides for relatively simple json schemas. When we try the same thing with `lm-format-enforcer`, we don't see a meaningful increase in time taken to compile a token enforcer for the same schemas.

On the related issue where we had a user passing in an invalid schema, you had mentioned [I was able to get this to compile in 80-90ms using our internal API](https://github.com/dottxt-ai/outlines-core/issues/180#issuecomment-2674136336). I had assumed you were talking about `outlines-core`, but maybe not? If there is a super secret sauce that y'all aren't open sourcing, I'd be happy to email you about it, but I did want to have this discussion in the open if we're just using `outlines-core` wrong.

Here's the anaylsis:
-----------

I did some experiments to measure the performance of the `outlines`, and try to find the root case of the hang/slowness. At first I thought it was some special keyword or syntax that was causing the hang, so I wasted some time trying to troubleshoot what was causing the problem. I eventually realized that the root cause of the problem was so simple that the number of fileds in an object would exponentially affect the computation time.

### Test 1: # of fields in 1 objects  vs time

Example schema when n = 3:  
```
{
  "type": "object",
  "properties": {
    "w1": {
      "type": "string"
    },
    "w2": {
      "type": "string"
    },
    "w3": {
      "type": "string"
    }
  }
}
```

We can see that the time grows exponentially as n increases.

![image](https://github.com/user-attachments/assets/d0706d88-0012-4931-9102-9b3667361e2b)

### Test 2: # of json object in depth nested vs time

Example schema when n = 3:  
```
{
  "type": "object",
  "properties": {
    "d1": {
      "type": "object",
      "description": "simple obj",
      "properties": {
        "d2": {
          "type": "object",
          "description": "simple obj",
          "properties": {
            "d3": {
              "type": "object",
              "description": "simple obj",
              "properties": {}
            }
          }
        }
      }
    }
  }
}
```

We can see that the time grows linearly as n increases, so the depth is not the root cause.

![image](https://github.com/user-attachments/assets/a7378f88-02d2-4086-ad3a-48384a81d66e)

### Test 3: total number of fields in a 2 depth json with nested object contains 6 fields vs time

Example schema when n = 12:
```
{
  "type": "object",
  "properties": {
    "w1": {
      "type": "object",
      "properties": {
        "w1": {
          "type": "string"
        },
        "w2": {
          "type": "string"
        },
        "w3": {
          "type": "string"
        },
        "w4": {
          "type": "string"
        },
        "w5": {
          "type": "string"
        },
        "w6": {
          "type": "string"
        }
      }
    },
    "w2": {
      "type": "object",
      "properties": {
        "w1": {
          "type": "string"
        },
        "w2": {
          "type": "string"
        },
        "w3": {
          "type": "string"
        },
        "w4": {
          "type": "string"
        },
        "w5": {
          "type": "string"
        },
        "w6": {
          "type": "string"
        }
      }
    }
  }
}
```  
![image](https://github.com/user-attachments/assets/47d258fb-1076-4fc9-adfd-3a06b727ba6e)

### Test 4: total number of fields in a 2 depth json with nested object contains 3 fields vs time

Example schema when n = 9:
```
{
  "type": "object",
  "properties": {
    "w1": {
      "type": "object",
      "properties": {
        "w1": {
          "type": "string"
        },
        "w2": {
          "type": "string"
        },
        "w3": {
          "type": "string"
        }
      }
    },
    "w2": {
      "type": "object",
      "properties": {
        "w1": {
          "type": "string"
        },
        "w2": {
          "type": "string"
        },
        "w3": {
          "type": "string"
        }
      }
    },
    "w3": {
      "type": "object",
      "properties": {
        "w1": {
          "type": "string"
        },
        "w2": {
          "type": "string"
        },
        "w3": {
          "type": "string"
        }
      }
    }
  }
}
```  

![image](https://github.com/user-attachments/assets/9d56fe19-9c99-444c-8dba-bbfcc6cf2bfc)

### conclusion

if we want to control the size of the input to avoid timeout or hang, the regex generation needs to less than 30s(I guess):
According to test1/3/4, the maximum number of fields in each object must be less than 9. And the total number of fields in the whole json needs to be less than 20.  

I'm not sure how many tools/api's would still be available if the above restrictions were applied to our api.......

Python code to reproduce this:
```py
import outlines

from outlines_core.fsm.json_schema import build_regex_from_schema
from outlines.fsm.guide import  RegexGuide
import json
from transformers import AutoTokenizer
import time
import copy
import matplotlib.pyplot as plt
import numpy as np

tk = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")
outlines_tokenizer = outlines.models.TransformerTokenizer(
    tk
)

simple_field = {
    "type": "string"
}

simple_obj = {
    "type": "object",
    "description": "simple obj",
    "properties": {}
}

root_schema = {
    "type": "object",
    "properties": {}
}

def generateTime(n, generateJsonObj):
    t = []
    for i in range(n+1):
        schema_str = json.dumps(generateJsonObj(i))
        #print(schema_str)
        start_time = time.time()
        regex = build_regex_from_schema(schema_str)
        guide = RegexGuide.from_regex(regex, outlines_tokenizer)
        end_time = time.time()
        elapsed_time = end_time - start_time
        t.append(elapsed_time)
        print(f"Elapsed time for build regex: {elapsed_time:.4f} seconds")
    # we use 0 to warmup, now remove it
    del t[0]
    return t

def plot(n, t, factor, name):
    # Data for plotting
    x = np.arange(1, n+1) * factor
    fig, ax = plt.subplots()
    ax.plot(x, t)
    ax.set(xlabel='n', ylabel='regex build time',
        title= name)
    ax.grid()
    fig.savefig(name + ".png")

def n_widths(base):
    def f(n):
        schema = copy.deepcopy(root_schema)
        for i in range(1, n+1):
            schema["properties"][ "w" + str(i) ] = base
        return schema
    return f

def n_depths(base):
    def f(n):
        schema = copy.deepcopy(root_schema)
        root = schema
        for i in range(1, n+1):
            root["properties"][ "d" + str(i) ] = copy.deepcopy(base)
            root = root["properties"][ "d" + str(i) ]
        return schema
    return f

# test width
max_width = 10
t = generateTime(max_width, n_widths(simple_field))
plot(max_width, t, 1, "num_fields_in_width_vs_time")
print("num_fields_in_width_vs_time json example:")
print(json.dumps(n_widths(simple_field)(3)))

# test depth
max_depth = 50
t = generateTime(max_depth, n_depths(simple_obj))
plot(max_depth, t, 1, "num_fields_in_depth_vs_time")
print("num_fields_in_depth_vs_time json example:")
print(json.dumps(n_depths(simple_obj)(3)))

# test n by 6 json
max_width = 5
base_json = n_widths(simple_field)(6)
t = generateTime(max_width, n_widths(base_json))
plot(max_width, t, 6, "total_num_fields_with_6_fields_base_vs_time")
print("total_num_fields_with_6_fields_base_vs_time json example:")
print(json.dumps(n_widths(base_json)(3)))

# test n by 3 json
max_width = 8
base_json = n_widths(simple_field)(3)
t = generateTime(max_width, n_widths(base_json))
plot(max_width, t, 3, "total_num_fields_with_3_fields_base_vs_time")
print("total_num_fields_with_3_fields_base_vs_time json example:")
print(json.dumps(n_widths(base_json)(3)))

```

</div>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Is compiling guides supposed to take a long time? #193

Discussed in #192

Here's the anaylsis:

Test 1: # of fields in 1 objects vs time

Test 2: # of json object in depth nested vs time

Test 3: total number of fields in a 2 depth json with nested object contains 6 fields vs time

Test 4: total number of fields in a 2 depth json with nested object contains 3 fields vs time

conclusion

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Is compiling guides supposed to take a long time? #193

Description

Discussed in #192

Here's the anaylsis:

Test 1: # of fields in 1 objects vs time

Test 2: # of json object in depth nested vs time

Test 3: total number of fields in a 2 depth json with nested object contains 6 fields vs time

Test 4: total number of fields in a 2 depth json with nested object contains 3 fields vs time

conclusion

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions