Replies: 3 comments 6 replies
-
All kinds of output shaping systems (llama.cpp grammar, jsonformer, guidance...) basically filter out unwanted tokens before sampling is done. |
Beta Was this translation helpful? Give feedback.
-
I'm also interested in this. I'm getting some unexpected behavior with my grammar never hitting it's terminal branch, but I can get what would be a terminal response from the llm (dolphin-mistral in this case) without any grammar. |
Beta Was this translation helpful? Give feedback.
-
@cab938 it must go token by token. See my comment explaining how I fixed my react grammar for my reasoning. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I'm teaching a short little course on how to use llama.cpp with the python bindings to run llama 2 on the CPU. I'm touching on some of the main API points, and would like to touch on grammars in llama.cpp as well as I think it's a pretty useful feature. I'm having troubles finding documentation about it, so pointers if I've just missed a write up are welcomed.
Specifically, I'm wondering if grammars are evaluated on a token by token basis (e.g. streaming), or only after the llm has finished generating the output sequence? Or, perhaps evaluated even before the token has been chosen, e.g. used to reduce the set of candidate tokens in a way similar to top_k/top_p parameters?
Any advice would be great!
Beta Was this translation helpful? Give feedback.
All reactions