-
-
Notifications
You must be signed in to change notification settings - Fork 413
Open
Labels
enhancementNew feature or requestNew feature or request
Description
Description
I would like to be able to pick a second, super-tiny model as my draft model and enable speculative decoding. Once loaded, I would hope that two models cooperating would be able to produce tokens faster than just the big one.
If possible, I would like to be able to see coloring of the output based on whether the token was an accepted prediction from the draft model or not.
Use Case
When the output is highly predictable given the context (such as when producing an edited copy of text in a previous response or quoting from a system prompt), this could accelerate token generation significantly without increasing the baseline token generation much (assuming good choice of draft model).
Background
- Speculative decoding support added to llama.cpp in November 2024: server : add speculative decoding support ggml-org/llama.cpp#10455
- Configuration options for speculative decoded added to llama.rn in December 2024: https://github.com/mybigday/llama.rn/blame/e5d53800af35cb0df1642c4b7dd9a882dd409f51/cpp/common.h#L187
- Speculative decoding support in LMStudio (which offers a nice example of UI design for selecting a draft model and how to see whether tokens were predicted by the draft model or not):
coder543
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request