Replies: 1 comment 1 reply
-
There is already the argument |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
It seems that depending on the system, the optimal thread count might not be the same for prompt eval, and token prediction. For example, here are the

llama-bench
result for my old 4-cores laptop:As we can see, the fastest prompt eval seels to be achived with only 2 threads, whereas the fastest token generation is with 4 threads.
So I was wondering if it was feasible to have two variants of the
--threads
argument to optimize for speed on each system?I haven't looked into the details of the implementation, so maybe it is required to have the same threads for eval and generation, but if it's not the case, I think it would be a nice improvement.
Beta Was this translation helpful? Give feedback.
All reactions