Replies: 2 comments
-
update: I ran a test and I have a sampling time of 1.6ms/tk on top-k 32000 and 0.1ms on top-k 32 Give it a try with this PR, and use a top-k of 10k for the test (at vocab size it won't help) |
Beta Was this translation helpful? Give feedback.
0 replies
-
What is the purpose of using top-k with anything more than say ~40? |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
It looks like there's still some low hanging fruit when it comes to unoptimized sampler logic in llama.cpp.
30% of the time is spent sampling for q8_0 Mistral 7b when generating 1024 tokens! (that is, if you use topk=0 or topk=32000 to avoid restricting the initial candidate set)
Beta Was this translation helpful? Give feedback.
All reactions