Skip to content

Speculative decoding support #322

@Lissanro

Description

@Lissanro

Prerequisites

  • I am running the latest code. Mention the version if possible as well.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

A while ago a patch to support speculative decoding was merged to llama.cpp:
ggml-org/llama.cpp#10455

I noticed that ik_llama.cpp has --model-draft and --gpu-layers-draft but they do not seem to do anything as far as I can tell (I see no speed up from using a draft model and nothing in the logs about the draft model being loaded), and ik_llama.cpp lacks options from the pull request that implements speculative decoding, like --draft-max, --draft-min, --device-draft and --draft-p-min, possibly some others.

Motivation

Recently, a draft model specifically for R1 was made: https://huggingface.co/jukofyork/DeepSeek-R1-DRAFT-0.5B-v1.0-GGUF - it would be great if it was possible to use it with ik_llama.cpp. Potentially, it could provide 1.5-2 speed up for inference.

Possible Implementation

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requesthelp wantedExtra attention is needed

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions