You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Outlines does not always allow for generating the canonical tokenization of text for classes of tokenizers prepending special symbols to inputs, which includes SentencePiece-based tokenizers (Llama, Phi, etc.).
We noticed this while using generator.choice, but it is likely applicable to other downstream constrained generation scenarios.
Specifically, for the model microsoft/Phi-3.5-mini-instruct, the label 'Pizza' is tokenized to [[(349, '▁P'), (24990, 'izza')], but the underlying automaton prevents the generation of (349, '▁P') as the first token, but allows all other ways to generate a word starting with P, e.g., (29925, 'P'), (12197, 'Pi'), or (11868, 'Pa').
We tried several things like adding a space before the labels (when constructing the choice generation object) or adding/removing spaces from the prompt, but none fully resolved the issue.
importoutlinesdefencode_text(model, raw):
tokenizer=model.tokenizer.tokenizerifisinstance(raw, str):
raw= [raw]
token_ids=tokenizer(raw, add_special_tokens=False)["input_ids"]
tokens= [
tokenizer.convert_ids_to_tokens(t_ids)
fort_idsintoken_ids
]
return [
list(zip(t_ids, ts))
fort_ids, tsinzip(token_ids, tokens)
]
model=outlines.models.transformers("microsoft/Phi-3.5-mini-instruct")
labels= ["Pizza", "Pasta", "Salad", "Dessert"]
# How are our labels tokenized?encoded_labels=encode_text(model, labels)
# > [[(349, '▁P'), (24990, 'izza')], [(349, '▁P'), (5427, 'asta')], [(3956, '▁Sal'), (328, 'ad')], [(360, '▁D'), (404, 'ess'), (814, 'ert')]]gen_choice=outlines.generate.choice(model, labels)
automaton=gen_choice.logits_processor.guide.get_index_dict()
# Which tokens are "allowed" in the first generation step?allowed_token_ids=list(automaton[0].keys())
allowed_tokens=model.tokenizer.tokenizer.convert_ids_to_tokens(allowed_token_ids)
list(zip(allowed_token_ids, allowed_tokens))
# > [(17618, 'Sa'), (71, '<0x44>'), (29925, 'P'), (12197, 'Pi'), (11868, 'Pa'), (86, '<0x53>'), (29903, 'S'), (29928, 'D'), (4002, 'Des'), (83, '<0x50>'), (2772, 'De'), (20392, 'Sal')]# Notice none of the first tokens of the label encodings are present.# Looking at the final token probabilities/log-probabilities, # we observe that the spaced first tokens are not represented
[((29928, 'D'), 0.6051735877990723, -0.5022398829460144), ((29925, 'P'), 0.2431625872850418, -1.414025068283081), ((29903, 'S'), 0.1414429396390915, -1.9558589458465576), ((2772, 'De'), 0.004034874960780144, -5.512779712677002), ((4002, 'Des'), 0.003415965009480715, -5.679295063018799), ((17618, 'Sa'), 0.002744789468124509, -5.898050785064697), ((20392, 'Sal'), 2.248118289571721e-05, -10.702832221984863), ((11868, 'Pa'), 2.7291971491649747e-06, -12.811503410339355), ((12197, 'Pi'), 2.0345812146160824e-08, -17.710390090942383), ((83, '<0x50>'), 5.7030637989896604e-09, -18.982261657714844), ((86, '<0x53>'), 5.697236460378008e-09, -18.98328399658203), ((71, '<0x44>'), 5.690426796434167e-09, -18.984479904174805)]
Expected result:
A constrained generation library should restrict the generated tokens to all sequences on the way to generating valid surface forms. Therefore, especially, the normal tokenization should be generable.
Error message:
Outlines/Python version information:
Version information
Python 3.9.7
outlines==0.1.14
Context for the issue:
At this point, we'd like to mention that we might be missing some parameter(s) resulting in this behavior. Please let us know if that is the case :)
How to possibly fix the issue:
The arguably most popular case of having a SentencePiece-based tokenizers could be detected, for the Transformers library, by detecting whether the tokenizer has a "sp_model_kwargs" property and the regex could be modified accordingly
Adding spaces in front of the labels (lists or enums) internally before generating regex's could mitigate this difference between tokenizers for generate.choice while not being visible to the user
The text was updated successfully, but these errors were encountered:
Describe the issue as clearly as possible:
Issue created together with @mcognetta and @jylee-k
Outlines does not always allow for generating the canonical tokenization of text for classes of tokenizers prepending special symbols to inputs, which includes SentencePiece-based tokenizers (Llama, Phi, etc.).
We noticed this while using
generator.choice
, but it is likely applicable to other downstream constrained generation scenarios.Specifically, for the model
microsoft/Phi-3.5-mini-instruct
, the label 'Pizza' is tokenized to[[(349, '▁P'), (24990, 'izza')]
, but the underlying automaton prevents the generation of(349, '▁P')
as the first token, but allows all other ways to generate a word starting withP
, e.g.,(29925, 'P')
,(12197, 'Pi')
, or(11868, 'Pa')
.We tried several things like adding a space before the labels (when constructing the
choice
generation object) or adding/removing spaces from the prompt, but none fully resolved the issue.Our code is based on the example from the README: https://github.com/dottxt-ai/outlines?tab=readme-ov-file#multiple-choices
Steps/code to reproduce the bug:
Expected result:
Error message:
Outlines/Python version information:
Version information
outlines==0.1.14
Context for the issue:
At this point, we'd like to mention that we might be missing some parameter(s) resulting in this behavior. Please let us know if that is the case :)
How to possibly fix the issue:
"sp_model_kwargs"
property and the regex could be modified accordinglygenerate.choice
while not being visible to the userThe text was updated successfully, but these errors were encountered: