Generation slowing down as context fills up #10309

Mushoz · 2024-11-15T12:21:07Z

Mushoz
Nov 15, 2024

I know the generation speed should slow down as the context starts to fill up, as LLMs are autoregressive. However, should the drop in speed be as severe as I am experiencing? I can't imagine running models at 32k or longer context sizes if the slowdown is already so substantial at sub 8k levels:

Tokens in context: Generation speed

17: 24.84 tokens per second
785: 22.67 tokens per second
1633: 20.75 tokens per second
2503: 18.91 tokens per second
3471: 17.24 tokens per second
4578: 15.79 tokens per second
5543: 14.62 tokens per second
6629: 13.35 tokens per second

I am running llama.cpp pulled 3 days ago on my 7900xtx through the following command:

llama-server --port 8999 -m /models/Qwen2.5-Coder-32B-Instruct-Q4_K_S.gguf -ngl 999 --ctx-size 8192 -fa

wooooyeahhhh · 2024-11-15T19:40:22Z

wooooyeahhhh
Nov 15, 2024

What backend? And yes generation speed should drop significantly

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Generation slowing down as context fills up #10309

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Generation slowing down as context fills up #10309

Uh oh!

Mushoz Nov 15, 2024

Replies: 1 comment

Uh oh!

wooooyeahhhh Nov 15, 2024

Mushoz
Nov 15, 2024

wooooyeahhhh
Nov 15, 2024