Skip to content

[Feat]: draft model for speculative decoding #226

@rndmcnlly

Description

@rndmcnlly

Description
I would like to be able to pick a second, super-tiny model as my draft model and enable speculative decoding. Once loaded, I would hope that two models cooperating would be able to produce tokens faster than just the big one.

If possible, I would like to be able to see coloring of the output based on whether the token was an accepted prediction from the draft model or not.

Use Case
When the output is highly predictable given the context (such as when producing an edited copy of text in a previous response or quoting from a system prompt), this could accelerate token generation significantly without increasing the baseline token generation much (assuming good choice of draft model).

Background

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions