Separate thread counts for prompt eval and generation #4760

stduhpf · 2024-01-03T15:49:36Z

stduhpf
Jan 3, 2024

It seems that depending on the system, the optimal thread count might not be the same for prompt eval, and token prediction. For example, here are the llama-bench result for my old 4-cores laptop:

As we can see, the fastest prompt eval seels to be achived with only 2 threads, whereas the fastest token generation is with 4 threads.

So I was wondering if it was feasible to have two variants of the --threads argument to optimize for speed on each system?

I haven't looked into the details of the implementation, so maybe it is required to have the same threads for eval and generation, but if it's not the case, I think it would be a nice improvement.

ggerganov · 2024-01-04T08:34:48Z

ggerganov
Jan 4, 2024
Maintainer

There is already the argument --threads-batch which controls the number of threads during pp

1 reply

stduhpf Jan 4, 2024
Author

Ah so that was this argument was for! Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Separate thread counts for prompt eval and generation #4760

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Separate thread counts for prompt eval and generation #4760

Uh oh!

stduhpf Jan 3, 2024

Replies: 1 comment · 1 reply

Uh oh!

ggerganov Jan 4, 2024 Maintainer

Uh oh!

stduhpf Jan 4, 2024 Author

stduhpf
Jan 3, 2024

Replies: 1 comment 1 reply

ggerganov
Jan 4, 2024
Maintainer

stduhpf Jan 4, 2024
Author