-
Notifications
You must be signed in to change notification settings - Fork 97
Description
Prerequisites
- I am running the latest code. Mention the version if possible as well.
- I carefully followed the README.md.
- I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- I reviewed the Discussions, and have a new and useful enhancement to share.
Feature Description
A while ago a patch to support speculative decoding was merged to llama.cpp:
ggml-org/llama.cpp#10455
I noticed that ik_llama.cpp has --model-draft and --gpu-layers-draft but they do not seem to do anything as far as I can tell (I see no speed up from using a draft model and nothing in the logs about the draft model being loaded), and ik_llama.cpp lacks options from the pull request that implements speculative decoding, like --draft-max, --draft-min, --device-draft and --draft-p-min, possibly some others.
Motivation
Recently, a draft model specifically for R1 was made: https://huggingface.co/jukofyork/DeepSeek-R1-DRAFT-0.5B-v1.0-GGUF - it would be great if it was possible to use it with ik_llama.cpp. Potentially, it could provide 1.5-2 speed up for inference.
Possible Implementation
No response