When are grammars evaluated? #5202

cab938 · 2024-01-29T22:21:16Z

cab938
Jan 29, 2024

I'm teaching a short little course on how to use llama.cpp with the python bindings to run llama 2 on the CPU. I'm touching on some of the main API points, and would like to touch on grammars in llama.cpp as well as I think it's a pretty useful feature. I'm having troubles finding documentation about it, so pointers if I've just missed a write up are welcomed.

Specifically, I'm wondering if grammars are evaluated on a token by token basis (e.g. streaming), or only after the llm has finished generating the output sequence? Or, perhaps evaluated even before the token has been chosen, e.g. used to reduce the set of candidate tokens in a way similar to top_k/top_p parameters?

Any advice would be great!

bullno1 · 2024-02-01T02:34:42Z

bullno1
Feb 1, 2024

All kinds of output shaping systems (llama.cpp grammar, jsonformer, guidance...) basically filter out unwanted tokens before sampling is done.

5 replies

benbot Feb 1, 2024

I'm not sure if that's the exact behavior going on here.

I have this small ReAct grammar

root ::= (think act observe)* | (think final)

think ::= "Thought: " thought "\n"

act ::= (
    "Action: " action "\n"
    "Action Input: " action-input "\n"
)

observe ::= "Observation: Ben is a programmer who likes playing with AI. \n"

final ::= (
  "Final Answer: " final-answer "\n"
)

thought ::= str
action ::= str
action-input ::= str
observation ::= str
final-answer ::= str
str ::= [0-9A-Za-z., ]+

but it never produces a final answer.
If I give llama cpp no grammar, it does come up with an actual answer (note that i'm spoofing the observation here)

bullno1 Feb 1, 2024

That depends a lot on your output or how your generation loop is written.

You can just check the source: https://github.com/ggerganov/llama.cpp/blob/b2037/llama.cpp#L8626-L8629
It just assigns -INFINITY to unwanted tokens so those will never be chosen.
That's pretty much all output shaping systems do.
Some use grammar to express rules, some use DSL but at the end of the day, it's just: "Which token do you want to pass on to the next stage?".

benbot Feb 1, 2024

There's no generation loop, just a single inference call until my max token count is hit

And sure, but it doesn't just apply that -INFINITY at the start of generation it must apply it per token.

I fixed this grammar by changing root ::== (think act observe)* | (think final) to root ::= (think act observe)* (think final)

The | seems to have caused some kind of branching behavior (which makes sense) which prevented the (think final) branch of the grammar from ever happening, suggesting it is per token.

bullno1 Feb 1, 2024

There's no generation loop, just a single inference call until my max token count is hit
it must apply it per token

That's the loop.

All sampling functions are called in the generation loop until some stop condition.

benbot Feb 1, 2024

That depends a lot on your output or how your generation loop is written.

I didn't write a generation loop, so it's just whatever the default llama.cpp behavior is.

benbot · 2024-02-01T05:20:03Z

benbot
Feb 1, 2024

I'm also interested in this. I'm getting some unexpected behavior with my grammar never hitting it's terminal branch, but I can get what would be a terminal response from the llm (dolphin-mistral in this case) without any grammar.

0 replies

benbot · 2024-02-01T07:09:05Z

benbot
Feb 1, 2024

@cab938 it must go token by token. See my comment explaining how I fixed my react grammar for my reasoning.

1 reply

cab938 Feb 7, 2024
Author

Thanks @benbot , appreciate it!

When are grammars evaluated? #5202

Uh oh!

Replies: 3 comments · 6 replies

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cab938 Feb 7, 2024 Author

Replies: 3 comments 6 replies

cab938 Feb 7, 2024
Author