Skip to content

Constrained generation does not account for tokenizers prepending special symbols #188

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
david-pohl opened this issue Feb 27, 2025 · 1 comment
Labels
bug Something isn't working

Comments

@david-pohl
Copy link

Describe the issue as clearly as possible:

Issue created together with @mcognetta and @jylee-k

Outlines does not always allow for generating the canonical tokenization of text for classes of tokenizers prepending special symbols to inputs, which includes SentencePiece-based tokenizers (Llama, Phi, etc.).

We noticed this while using generator.choice, but it is likely applicable to other downstream constrained generation scenarios.

Specifically, for the model microsoft/Phi-3.5-mini-instruct, the label 'Pizza' is tokenized to [[(349, '▁P'), (24990, 'izza')], but the underlying automaton prevents the generation of (349, '▁P') as the first token, but allows all other ways to generate a word starting with P, e.g., (29925, 'P'), (12197, 'Pi'), or (11868, 'Pa').

We tried several things like adding a space before the labels (when constructing the choice generation object) or adding/removing spaces from the prompt, but none fully resolved the issue.

Our code is based on the example from the README: https://github.com/dottxt-ai/outlines?tab=readme-ov-file#multiple-choices

Steps/code to reproduce the bug:

import outlines

def encode_text(model, raw):
    tokenizer = model.tokenizer.tokenizer

    if isinstance(raw, str):
        raw = [raw]

    token_ids = tokenizer(raw, add_special_tokens=False)["input_ids"]
    tokens = [
        tokenizer.convert_ids_to_tokens(t_ids) 
        for t_ids in token_ids
    ]

    return [
        list(zip(t_ids, ts))
        for t_ids, ts in zip(token_ids, tokens)
    ]

model = outlines.models.transformers("microsoft/Phi-3.5-mini-instruct")
labels = ["Pizza", "Pasta", "Salad", "Dessert"]

# How are our labels tokenized?
encoded_labels = encode_text(model, labels)

# > [[(349, '▁P'), (24990, 'izza')], [(349, '▁P'), (5427, 'asta')], [(3956, '▁Sal'), (328, 'ad')], [(360, '▁D'), (404, 'ess'), (814, 'ert')]]

gen_choice = outlines.generate.choice(model, labels)

automaton =  gen_choice.logits_processor.guide.get_index_dict()

# Which tokens are "allowed" in the first generation step?
allowed_token_ids = list(automaton[0].keys())
allowed_tokens = model.tokenizer.tokenizer.convert_ids_to_tokens(allowed_token_ids)

list(zip(allowed_token_ids, allowed_tokens))
# > [(17618, 'Sa'), (71, '<0x44>'), (29925, 'P'), (12197, 'Pi'), (11868, 'Pa'), (86, '<0x53>'), (29903, 'S'), (29928, 'D'), (4002, 'Des'), (83, '<0x50>'), (2772, 'De'), (20392, 'Sal')]

# Notice none of the first tokens of the label encodings are present.
# Looking at the final token probabilities/log-probabilities, 
# we observe that the spaced first tokens are not represented

[((29928, 'D'), 0.6051735877990723, -0.5022398829460144), ((29925, 'P'), 0.2431625872850418, -1.414025068283081), ((29903, 'S'), 0.1414429396390915, -1.9558589458465576), ((2772, 'De'), 0.004034874960780144, -5.512779712677002), ((4002, 'Des'), 0.003415965009480715, -5.679295063018799), ((17618, 'Sa'), 0.002744789468124509, -5.898050785064697), ((20392, 'Sal'), 2.248118289571721e-05, -10.702832221984863), ((11868, 'Pa'), 2.7291971491649747e-06, -12.811503410339355), ((12197, 'Pi'), 2.0345812146160824e-08, -17.710390090942383), ((83, '<0x50>'), 5.7030637989896604e-09, -18.982261657714844), ((86, '<0x53>'), 5.697236460378008e-09, -18.98328399658203), ((71, '<0x44>'), 5.690426796434167e-09, -18.984479904174805)]

Expected result:

A constrained generation library should restrict the generated tokens to all sequences on the way to generating valid surface forms. Therefore, especially, the normal tokenization should be generable.

Error message:

Outlines/Python version information:

Version information

Python 3.9.7

outlines==0.1.14

Context for the issue:

At this point, we'd like to mention that we might be missing some parameter(s) resulting in this behavior. Please let us know if that is the case :)

How to possibly fix the issue:

  1. The arguably most popular case of having a SentencePiece-based tokenizers could be detected, for the Transformers library, by detecting whether the tokenizer has a "sp_model_kwargs" property and the regex could be modified accordingly
  2. Adding spaces in front of the labels (lists or enums) internally before generating regex's could mitigate this difference between tokenizers for generate.choice while not being visible to the user
@david-pohl david-pohl added the bug Something isn't working label Feb 27, 2025
@RobinPicard RobinPicard transferred this issue from dottxt-ai/outlines Mar 4, 2025
@RobinPicard
Copy link
Contributor

Thanks a lot for such a detailed issue @david-pohl! I transferred it to outlines-core as it's where the operations related to token selection happen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants