Replies: 1 comment
-
I think this was discussed in the past but I can't recall what was the conclusion. I suppose that more logits requires more memory to be copied from the GPU so it's normal to cause slowdown. Though I am not sure if the current implementation is optimal. I will likely revisit this logic soon within the context of #11213. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi everyone,
I'm experimenting with llama.cpp for a project, and I’d like to get the logits from the GGUF model. However, when passing logits=True, it takes almost double the time compared to generating only the tokens. how can I optimize it?
If anyone could provide suggestions or guidance to optimize this process or retrieve logits efficiently, I’d greatly appreciate it!
Beta Was this translation helpful? Give feedback.
All reactions