Speculative sampling? #3863

ghost · 2023-10-31T00:20:04Z

ghost
Oct 31, 2023

Claims 2-2.5x speedup by getting multiple tokens per transformer call, could help a lot with large models living mostly in RAM vs VRAM. https://arxiv.org/abs/2302.01318. Seems textsynth already has this feature.

KerfuffleV2 · 2023-10-31T00:37:49Z

KerfuffleV2
Oct 31, 2023
Collaborator

Look in examples/speculative

1 reply

ghost Oct 31, 2023

thank you

ghost · 2023-10-31T01:11:08Z

ghost
Oct 31, 2023

So I guess to actually use it there would need to be a draft model in addition to the full transformer and both would have to be loaded into mem in unison, then it would be necessary to have a server parm which selected the number of spec samples if I understood the example correctly. Don't completely understand where the draft model is supposed to come from though.

6 replies

ghost Oct 31, 2023

Thanks. I just dug through the cornell paper and came away with the same understanding that you just described, they seem to recommend also just using a smaller version of the same model, though I guess in theory it would be possible to draft with a low quant mistral and see if some of the knowledge of falcon 180B loaded full into main mem can be extracted faster, 2.5x average is a lot of speedup. This approach seems quite interesting to help the memory bw bottleneck issue.

KerfuffleV2 Oct 31, 2023
Collaborator

though I guess in theory it would be possible to draft with a low quant mistral and see if some of the knowledge of falcon 180B l

Maybe, just keep in mind that whatever the draft model suggests has to be fed to the full model. This can be done as a batch, but still if you mispredict too much stuff even with a very fast draft model it's likely to make the whole speculation stuff not worth it.

It also kind of seems like you need a draft model that works well enough to frequently draft decently long runs of tokens, rather than accurately drafting one token, feeding it to the main model, accurately drafting one token, etc. Just as an example, with my self-speculation stuff, even with 50% coming from the draft model and an 80% success success in predictions it still wasn't even enough to break even (that was testing with a 70B).

ghost Oct 31, 2023

I suspect the secret sauce here is in creating a good draft model. Sort of like downscaling an image becomes fuzzy but is much smaller but still has the same basic picture need a general scheme to downscale a model making it more fuzzy but still the same basic model (more than just quantizing it to low precision coefs). Most likely the mistral + falcon is a bad idea but a downsampled falcon180 is what would needed (not falcon 7B, but falcon 180B downsampled to 7B size by some smart algorithm). I guess that would be an entire new area of research needed to help make this idea real. Decimating layers most likely was too coarse but its in line with the general idea of making a fuzzy smaller version of a bigger model. Alternately some models might be coming out with drafts built in in the future as the paper suggested which would solve the problem also up front, which is most likely the best approach of all.

KerfuffleV2 Oct 31, 2023
Collaborator

Alternately some models might be coming out with drafts built in in the future as the paper suggested which would solve the problem also up front, which is most likely the best approach of all.

Medusa sort of works like that: https://github.com/FasterDecoding/Medusa

ghost Oct 31, 2023

Yeah I think that is the way. When models are being creating they need to train a draft and the main model at same time off the same data as you alluded earlier. Will be interesting to see if the mistral guys leverage this idea in their next release (maybe a 30B main and 7B draft combo). My guess is decimating an existing model is going to be a very hard problem but creating draft and main at training time gets around the whole problem.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Speculative sampling? #3863

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 7 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Speculative sampling? #3863

Uh oh!

ghost Oct 31, 2023

Replies: 2 comments · 7 replies

Uh oh!

KerfuffleV2 Oct 31, 2023 Collaborator

Uh oh!

ghost Oct 31, 2023

Uh oh!

ghost Oct 31, 2023

Uh oh!

ghost Oct 31, 2023

Uh oh!

KerfuffleV2 Oct 31, 2023 Collaborator

Uh oh!

ghost Oct 31, 2023

Uh oh!

KerfuffleV2 Oct 31, 2023 Collaborator

Uh oh!

ghost Oct 31, 2023

ghost
Oct 31, 2023

Replies: 2 comments 7 replies

KerfuffleV2
Oct 31, 2023
Collaborator

ghost
Oct 31, 2023

KerfuffleV2 Oct 31, 2023
Collaborator

KerfuffleV2 Oct 31, 2023
Collaborator