Replies: 1 comment
-
What backend? And yes generation speed should drop significantly |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I know the generation speed should slow down as the context starts to fill up, as LLMs are autoregressive. However, should the drop in speed be as severe as I am experiencing? I can't imagine running models at 32k or longer context sizes if the slowdown is already so substantial at sub 8k levels:
I am running llama.cpp pulled 3 days ago on my 7900xtx through the following command:
llama-server --port 8999 -m /models/Qwen2.5-Coder-32B-Instruct-Q4_K_S.gguf -ngl 999 --ctx-size 8192 -fa
Beta Was this translation helpful? Give feedback.
All reactions