Constrained generation does not account for tokenizers prepending special symbols #188

david-pohl · 2025-02-27T16:20:24Z

Describe the issue as clearly as possible:

Issue created together with @mcognetta and @jylee-k

Outlines does not always allow for generating the canonical tokenization of text for classes of tokenizers prepending special symbols to inputs, which includes SentencePiece-based tokenizers (Llama, Phi, etc.).

We noticed this while using generator.choice, but it is likely applicable to other downstream constrained generation scenarios.

Specifically, for the model microsoft/Phi-3.5-mini-instruct, the label 'Pizza' is tokenized to [[(349, '▁P'), (24990, 'izza')], but the underlying automaton prevents the generation of (349, '▁P') as the first token, but allows all other ways to generate a word starting with P, e.g., (29925, 'P'), (12197, 'Pi'), or (11868, 'Pa').

We tried several things like adding a space before the labels (when constructing the choice generation object) or adding/removing spaces from the prompt, but none fully resolved the issue.

Our code is based on the example from the README: https://github.com/dottxt-ai/outlines?tab=readme-ov-file#multiple-choices

Steps/code to reproduce the bug:

import outlines

def encode_text(model, raw):
    tokenizer = model.tokenizer.tokenizer

    if isinstance(raw, str):
        raw = [raw]

    token_ids = tokenizer(raw, add_special_tokens=False)["input_ids"]
    tokens = [
        tokenizer.convert_ids_to_tokens(t_ids) 
        for t_ids in token_ids
    ]

    return [
        list(zip(t_ids, ts))
        for t_ids, ts in zip(token_ids, tokens)
    ]

model = outlines.models.transformers("microsoft/Phi-3.5-mini-instruct")
labels = ["Pizza", "Pasta", "Salad", "Dessert"]

# How are our labels tokenized?
encoded_labels = encode_text(model, labels)

# > [[(349, '▁P'), (24990, 'izza')], [(349, '▁P'), (5427, 'asta')], [(3956, '▁Sal'), (328, 'ad')], [(360, '▁D'), (404, 'ess'), (814, 'ert')]]

gen_choice = outlines.generate.choice(model, labels)

automaton =  gen_choice.logits_processor.guide.get_index_dict()

# Which tokens are "allowed" in the first generation step?
allowed_token_ids = list(automaton[0].keys())
allowed_tokens = model.tokenizer.tokenizer.convert_ids_to_tokens(allowed_token_ids)

list(zip(allowed_token_ids, allowed_tokens))
# > [(17618, 'Sa'), (71, '<0x44>'), (29925, 'P'), (12197, 'Pi'), (11868, 'Pa'), (86, '<0x53>'), (29903, 'S'), (29928, 'D'), (4002, 'Des'), (83, '<0x50>'), (2772, 'De'), (20392, 'Sal')]

# Notice none of the first tokens of the label encodings are present.
# Looking at the final token probabilities/log-probabilities, 
# we observe that the spaced first tokens are not represented

[((29928, 'D'), 0.6051735877990723, -0.5022398829460144), ((29925, 'P'), 0.2431625872850418, -1.414025068283081), ((29903, 'S'), 0.1414429396390915, -1.9558589458465576), ((2772, 'De'), 0.004034874960780144, -5.512779712677002), ((4002, 'Des'), 0.003415965009480715, -5.679295063018799), ((17618, 'Sa'), 0.002744789468124509, -5.898050785064697), ((20392, 'Sal'), 2.248118289571721e-05, -10.702832221984863), ((11868, 'Pa'), 2.7291971491649747e-06, -12.811503410339355), ((12197, 'Pi'), 2.0345812146160824e-08, -17.710390090942383), ((83, '<0x50>'), 5.7030637989896604e-09, -18.982261657714844), ((86, '<0x53>'), 5.697236460378008e-09, -18.98328399658203), ((71, '<0x44>'), 5.690426796434167e-09, -18.984479904174805)]

Expected result:

A constrained generation library should restrict the generated tokens to all sequences on the way to generating valid surface forms. Therefore, especially, the normal tokenization should be generable.

Error message:

Outlines/Python version information:

Version information

Python 3.9.7

outlines==0.1.14

Context for the issue:

At this point, we'd like to mention that we might be missing some parameter(s) resulting in this behavior. Please let us know if that is the case :)

How to possibly fix the issue:

The arguably most popular case of having a SentencePiece-based tokenizers could be detected, for the Transformers library, by detecting whether the tokenizer has a "sp_model_kwargs" property and the regex could be modified accordingly
Adding spaces in front of the labels (lists or enums) internally before generating regex's could mitigate this difference between tokenizers for generate.choice while not being visible to the user

The text was updated successfully, but these errors were encountered:

RobinPicard · 2025-03-04T21:15:27Z

Thanks a lot for such a detailed issue @david-pohl! I transferred it to outlines-core as it's where the operations related to token selection happen.

david-pohl added the bug Something isn't working label Feb 27, 2025

RobinPicard transferred this issue from dottxt-ai/outlines Mar 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Constrained generation does not account for tokenizers prepending special symbols #188

Constrained generation does not account for tokenizers prepending special symbols #188

david-pohl commented Feb 27, 2025

RobinPicard commented Mar 4, 2025

Constrained generation does not account for tokenizers prepending special symbols #188

Constrained generation does not account for tokenizers prepending special symbols #188

Comments

david-pohl commented Feb 27, 2025

Describe the issue as clearly as possible:

Steps/code to reproduce the bug:

Expected result:

Error message:

Outlines/Python version information:

Context for the issue:

How to possibly fix the issue:

RobinPicard commented Mar 4, 2025