Description
Describe the issue as clearly as possible:
Issue created together with @mcognetta and @jylee-k
Outlines does not always allow for generating the canonical tokenization of text for classes of tokenizers prepending special symbols to inputs, which includes SentencePiece-based tokenizers (Llama, Phi, etc.).
We noticed this while using generator.choice
, but it is likely applicable to other downstream constrained generation scenarios.
Specifically, for the model microsoft/Phi-3.5-mini-instruct
, the label 'Pizza' is tokenized to [[(349, '▁P'), (24990, 'izza')]
, but the underlying automaton prevents the generation of (349, '▁P')
as the first token, but allows all other ways to generate a word starting with P
, e.g., (29925, 'P')
, (12197, 'Pi')
, or (11868, 'Pa')
.
We tried several things like adding a space before the labels (when constructing the choice
generation object) or adding/removing spaces from the prompt, but none fully resolved the issue.
Our code is based on the example from the README: https://github.com/dottxt-ai/outlines?tab=readme-ov-file#multiple-choices
Steps/code to reproduce the bug:
import outlines
def encode_text(model, raw):
tokenizer = model.tokenizer.tokenizer
if isinstance(raw, str):
raw = [raw]
token_ids = tokenizer(raw, add_special_tokens=False)["input_ids"]
tokens = [
tokenizer.convert_ids_to_tokens(t_ids)
for t_ids in token_ids
]
return [
list(zip(t_ids, ts))
for t_ids, ts in zip(token_ids, tokens)
]
model = outlines.models.transformers("microsoft/Phi-3.5-mini-instruct")
labels = ["Pizza", "Pasta", "Salad", "Dessert"]
# How are our labels tokenized?
encoded_labels = encode_text(model, labels)
# > [[(349, '▁P'), (24990, 'izza')], [(349, '▁P'), (5427, 'asta')], [(3956, '▁Sal'), (328, 'ad')], [(360, '▁D'), (404, 'ess'), (814, 'ert')]]
gen_choice = outlines.generate.choice(model, labels)
automaton = gen_choice.logits_processor.guide.get_index_dict()
# Which tokens are "allowed" in the first generation step?
allowed_token_ids = list(automaton[0].keys())
allowed_tokens = model.tokenizer.tokenizer.convert_ids_to_tokens(allowed_token_ids)
list(zip(allowed_token_ids, allowed_tokens))
# > [(17618, 'Sa'), (71, '<0x44>'), (29925, 'P'), (12197, 'Pi'), (11868, 'Pa'), (86, '<0x53>'), (29903, 'S'), (29928, 'D'), (4002, 'Des'), (83, '<0x50>'), (2772, 'De'), (20392, 'Sal')]
# Notice none of the first tokens of the label encodings are present.
# Looking at the final token probabilities/log-probabilities,
# we observe that the spaced first tokens are not represented
[((29928, 'D'), 0.6051735877990723, -0.5022398829460144), ((29925, 'P'), 0.2431625872850418, -1.414025068283081), ((29903, 'S'), 0.1414429396390915, -1.9558589458465576), ((2772, 'De'), 0.004034874960780144, -5.512779712677002), ((4002, 'Des'), 0.003415965009480715, -5.679295063018799), ((17618, 'Sa'), 0.002744789468124509, -5.898050785064697), ((20392, 'Sal'), 2.248118289571721e-05, -10.702832221984863), ((11868, 'Pa'), 2.7291971491649747e-06, -12.811503410339355), ((12197, 'Pi'), 2.0345812146160824e-08, -17.710390090942383), ((83, '<0x50>'), 5.7030637989896604e-09, -18.982261657714844), ((86, '<0x53>'), 5.697236460378008e-09, -18.98328399658203), ((71, '<0x44>'), 5.690426796434167e-09, -18.984479904174805)]
Expected result:
A constrained generation library should restrict the generated tokens to all sequences on the way to generating valid surface forms. Therefore, especially, the normal tokenization should be generable.
Error message:
Outlines/Python version information:
Version information
outlines==0.1.14
Context for the issue:
At this point, we'd like to mention that we might be missing some parameter(s) resulting in this behavior. Please let us know if that is the case :)
How to possibly fix the issue:
- The arguably most popular case of having a SentencePiece-based tokenizers could be detected, for the Transformers library, by detecting whether the tokenizer has a
"sp_model_kwargs"
property and the regex could be modified accordingly - Adding spaces in front of the labels (lists or enums) internally before generating regex's could mitigate this difference between tokenizers for
generate.choice
while not being visible to the user