Replies: 1 comment 1 reply
-
This is already implemented in the speculative example (#3624) |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Paper: https://arxiv.org/abs/2312.12728v2
So instead of only drafting a single sequence, draft multiple sequences.
Then the model can validate more than just one sequence.
They also use a trie to efficiently utilizes the KV cache.
This works because draft sequences will most likely share prefixes, just like in beam search.
The actual draft method is not explicitly mentioned and it can be generic.
Although it seems it's mostly just n-gram lookup because the output is grounded (aka copied from the prompt/data).
I think it is doable with the current API.
Beta Was this translation helpful? Give feedback.
All reactions