Replies: 1 comment
-
YASSSSS! vllm does this easily. would be awesome to see support in llamafile and llama-cpp!!! |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I have been playing with tabbyAPI and its support for draft models. In short, the performance benefit is very, very obvious. And I wonder what it could mean for inference on the CPU. Or even a mixed CPU/NPU/GPU setup.
Intuitively (which may be very wrong, of course) I think this could allow for 14B models being more available to the GPU poor. In a practical sense.
See some ballpark numbers without speculative decoding here:
Beta Was this translation helpful? Give feedback.
All reactions