[Feat]: draft model for speculative decoding

**Description**
I would like to be able to pick a second, super-tiny model as my draft model and enable **speculative decoding**. Once loaded, I would hope that two models cooperating would be able to produce tokens faster than just the big one.

If possible, I would like to be able to see coloring of the output based on whether the token was an accepted prediction from the draft model or not.

**Use Case**
When the output is highly predictable given the context (such as when producing an edited copy of text in a previous response or quoting from a system prompt), this could accelerate token generation significantly without increasing the baseline token generation much (assuming good choice of draft model).

**Background**
- Speculative decoding support added to llama.cpp in November 2024: https://github.com/ggml-org/llama.cpp/pull/10455
- Configuration options for speculative decoded added to llama.rn in December 2024: https://github.com/mybigday/llama.rn/blame/e5d53800af35cb0df1642c4b7dd9a882dd409f51/cpp/common.h#L187
- Speculative decoding support in LMStudio (which offers a nice example of UI design for selecting a draft model and how to see whether tokens were predicted by the draft model or not): 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Feat]: draft model for speculative decoding #226

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

[Feat]: draft model for speculative decoding #226

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions