Skip to content

Constrained generation does not account for tokenizers prepending special symbols #188

Open
@david-pohl

Description

@david-pohl

Describe the issue as clearly as possible:

Issue created together with @mcognetta and @jylee-k

Outlines does not always allow for generating the canonical tokenization of text for classes of tokenizers prepending special symbols to inputs, which includes SentencePiece-based tokenizers (Llama, Phi, etc.).

We noticed this while using generator.choice, but it is likely applicable to other downstream constrained generation scenarios.

Specifically, for the model microsoft/Phi-3.5-mini-instruct, the label 'Pizza' is tokenized to [[(349, '▁P'), (24990, 'izza')], but the underlying automaton prevents the generation of (349, '▁P') as the first token, but allows all other ways to generate a word starting with P, e.g., (29925, 'P'), (12197, 'Pi'), or (11868, 'Pa').

We tried several things like adding a space before the labels (when constructing the choice generation object) or adding/removing spaces from the prompt, but none fully resolved the issue.

Our code is based on the example from the README: https://github.com/dottxt-ai/outlines?tab=readme-ov-file#multiple-choices

Steps/code to reproduce the bug:

import outlines

def encode_text(model, raw):
    tokenizer = model.tokenizer.tokenizer

    if isinstance(raw, str):
        raw = [raw]

    token_ids = tokenizer(raw, add_special_tokens=False)["input_ids"]
    tokens = [
        tokenizer.convert_ids_to_tokens(t_ids) 
        for t_ids in token_ids
    ]

    return [
        list(zip(t_ids, ts))
        for t_ids, ts in zip(token_ids, tokens)
    ]

model = outlines.models.transformers("microsoft/Phi-3.5-mini-instruct")
labels = ["Pizza", "Pasta", "Salad", "Dessert"]

# How are our labels tokenized?
encoded_labels = encode_text(model, labels)

# > [[(349, '▁P'), (24990, 'izza')], [(349, '▁P'), (5427, 'asta')], [(3956, '▁Sal'), (328, 'ad')], [(360, '▁D'), (404, 'ess'), (814, 'ert')]]

gen_choice = outlines.generate.choice(model, labels)

automaton =  gen_choice.logits_processor.guide.get_index_dict()

# Which tokens are "allowed" in the first generation step?
allowed_token_ids = list(automaton[0].keys())
allowed_tokens = model.tokenizer.tokenizer.convert_ids_to_tokens(allowed_token_ids)

list(zip(allowed_token_ids, allowed_tokens))
# > [(17618, 'Sa'), (71, '<0x44>'), (29925, 'P'), (12197, 'Pi'), (11868, 'Pa'), (86, '<0x53>'), (29903, 'S'), (29928, 'D'), (4002, 'Des'), (83, '<0x50>'), (2772, 'De'), (20392, 'Sal')]

# Notice none of the first tokens of the label encodings are present.
# Looking at the final token probabilities/log-probabilities, 
# we observe that the spaced first tokens are not represented

[((29928, 'D'), 0.6051735877990723, -0.5022398829460144), ((29925, 'P'), 0.2431625872850418, -1.414025068283081), ((29903, 'S'), 0.1414429396390915, -1.9558589458465576), ((2772, 'De'), 0.004034874960780144, -5.512779712677002), ((4002, 'Des'), 0.003415965009480715, -5.679295063018799), ((17618, 'Sa'), 0.002744789468124509, -5.898050785064697), ((20392, 'Sal'), 2.248118289571721e-05, -10.702832221984863), ((11868, 'Pa'), 2.7291971491649747e-06, -12.811503410339355), ((12197, 'Pi'), 2.0345812146160824e-08, -17.710390090942383), ((83, '<0x50>'), 5.7030637989896604e-09, -18.982261657714844), ((86, '<0x53>'), 5.697236460378008e-09, -18.98328399658203), ((71, '<0x44>'), 5.690426796434167e-09, -18.984479904174805)]

Expected result:

A constrained generation library should restrict the generated tokens to all sequences on the way to generating valid surface forms. Therefore, especially, the normal tokenization should be generable.

Error message:

Outlines/Python version information:

Version information

Python 3.9.7

outlines==0.1.14

Context for the issue:

At this point, we'd like to mention that we might be missing some parameter(s) resulting in this behavior. Please let us know if that is the case :)

How to possibly fix the issue:

  1. The arguably most popular case of having a SentencePiece-based tokenizers could be detected, for the Transformers library, by detecting whether the tokenizer has a "sp_model_kwargs" property and the regex could be modified accordingly
  2. Adding spaces in front of the labels (lists or enums) internally before generating regex's could mitigate this difference between tokenizers for generate.choice while not being visible to the user

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions