Will it be possible to use speculative sampling in llama.cpp? #2854

qnixsynapse · 2023-08-28T13:57:50Z

qnixsynapse
Aug 28, 2023
Collaborator

I was reading Fast Inference from Transformers via Speculative Decoding by Yaniv Leviathan et al. in which an smaller approximation model (with lower number of parameters) aids in the decoding of a larger target model(the actual model which is being inference which has a large amount of parameters).

The paper claims, "inference from large models is often not bottlenecked on arithmetic operations, but rather on memory bandwidth and communication".

This might be true for CPUs and those GPUs which do not have enough memory bandwidth.

Also it claims, "This approach works because some language modeling tasks have easier subtasks that can be solved by simpler/smaller models"

So I am wondering whether it is possible for us or not. And what needs to be done to implement this.

ggerganov · 2023-08-28T15:42:33Z

You can track the progress here #2030

1 reply

Awesome! Thank you!!