Replies: 2 comments 7 replies
-
Look in |
Beta Was this translation helpful? Give feedback.
1 reply
-
So I guess to actually use it there would need to be a draft model in addition to the full transformer and both would have to be loaded into mem in unison, then it would be necessary to have a server parm which selected the number of spec samples if I understood the example correctly. Don't completely understand where the draft model is supposed to come from though. |
Beta Was this translation helpful? Give feedback.
6 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Claims 2-2.5x speedup by getting multiple tokens per transformer call, could help a lot with large models living mostly in RAM vs VRAM. https://arxiv.org/abs/2302.01318. Seems textsynth already has this feature.
Beta Was this translation helpful? Give feedback.
All reactions