New hierarchical+parallel speculative decoding method, badly named "Lookahead" #5091

bullno1 · 2024-01-23T04:16:13Z

bullno1
Jan 23, 2024

Paper: https://arxiv.org/abs/2312.12728v2

So instead of only drafting a single sequence, draft multiple sequences.
Then the model can validate more than just one sequence.

They also use a trie to efficiently utilizes the KV cache.
This works because draft sequences will most likely share prefixes, just like in beam search.

The actual draft method is not explicitly mentioned and it can be generic.
Although it seems it's mostly just n-gram lookup because the output is grounded (aka copied from the prompt/data).

I think it is doable with the current API.

ggerganov · 2024-01-23T07:03:37Z

ggerganov
Jan 23, 2024
Maintainer

This is already implemented in the speculative example (#3624)

1 reply

bullno1 Jan 23, 2024
Author

I see. Somehow I missed that.

So it didn't improve speed? Is it because of low acceptance rate?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

New hierarchical+parallel speculative decoding method, badly named "Lookahead" #5091

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

New hierarchical+parallel speculative decoding method, badly named "Lookahead" #5091

Uh oh!

bullno1 Jan 23, 2024

Replies: 1 comment · 1 reply

Uh oh!

Uh oh!

ggerganov Jan 23, 2024 Maintainer

Uh oh!

bullno1 Jan 23, 2024 Author

bullno1
Jan 23, 2024

Replies: 1 comment 1 reply

ggerganov
Jan 23, 2024
Maintainer

bullno1 Jan 23, 2024
Author