Sampling settings can dramatically change generation throughput #5073

kalomaze · 2024-01-22T08:21:38Z

kalomaze
Jan 22, 2024

It looks like there's still some low hanging fruit when it comes to unoptimized sampler logic in llama.cpp.

30% of the time is spent sampling for q8_0 Mistral 7b when generating 1024 tokens! (that is, if you use topk=0 or topk=32000 to avoid restricting the initial candidate set)

cmp-nct · 2024-01-22T19:01:48Z

cmp-nct
Jan 22, 2024

update:

I ran a test and I have a sampling time of 1.6ms/tk on top-k 32000 and 0.1ms on top-k 32
So that's 10 times slower, but you were 100 times slower.

Give it a try with this PR, and use a top-k of 10k for the test (at vocab size it won't help)
#5085

0 replies

ggerganov · 2024-01-25T19:52:31Z

ggerganov
Jan 25, 2024
Maintainer

What is the purpose of using top-k with anything more than say ~40?
The idea of top-k is to select the top tokens, not to sort all of them

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Sampling settings can dramatically change generation throughput #5073

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Sampling settings can dramatically change generation throughput #5073

Uh oh!

Uh oh!

kalomaze Jan 22, 2024

Replies: 2 comments

Uh oh!

Uh oh!

cmp-nct Jan 22, 2024

Uh oh!

ggerganov Jan 25, 2024 Maintainer

kalomaze
Jan 22, 2024

cmp-nct
Jan 22, 2024

ggerganov
Jan 25, 2024
Maintainer