Will it be possible to use speculative sampling in llama.cpp? #2854
qnixsynapse
started this conversation in
Ideas
Replies: 1 comment 1 reply
-
You can track the progress here #2030 |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I was reading Fast Inference from Transformers via Speculative Decoding by Yaniv Leviathan et al. in which an smaller approximation model (with lower number of parameters) aids in the decoding of a larger target model(the actual model which is being inference which has a large amount of parameters).
The paper claims, "inference from large models is often not bottlenecked on arithmetic operations, but rather on memory bandwidth and communication".
This might be true for CPUs and those GPUs which do not have enough memory bandwidth.
Also it claims, "This approach works because some language modeling tasks have easier subtasks that can be solved by simpler/smaller models"
So I am wondering whether it is possible for us or not. And what needs to be done to implement this.
Beta Was this translation helpful? Give feedback.
All reactions