Missing something - Slow time to first token #10141

icsy7867 · 2024-11-02T17:24:01Z

icsy7867
Nov 2, 2024

Hello! I think I am missing something. I am trying to test out llamacpp but I am getting very long delays before getting a response.

I have tried a few different things and I read through the docs, but I think I might be missing something.

I am currently using podman with an NVIDIA Quadro P6000, and I am using:
ghcr.io/ggerganov/llama.cpp:server-cuda

 -m models/Qwen-32b-Q4_K_M.gguf -c 25000 --host 0.0.0.0 --port 8080 --n-gpu-layers 99 --flash-attn

I have tried with and without flash attn, but it doest seem to affect anything. As a comparison, I tried koboldcpp which seems to be pretty close to the llamacpp implementation.

Using koboldCPP I have a fairly similar setup... (I have tried within a container and directly on my host to make sure that wasnt a variable...)

./koboldcpp --quiet --usecublas --flashattention --gpulayers 999 --contextsize 24000

Generally with llamacpp, I am experiencing about 30-40 seconds before I start getting a streaming reply. (I am using chat completion FWIW...)

With kobold I am seeing about 1-6 seconds before getting a response. For Llamacpp I have tried several different versions. To make sure I pulled the latest llama.cpp:server-cuda to make sure the container was up to date at least.

So I am just curious if there was something stupid I am missing? Apologies if this is a dumb questions, I know just enough to know that I do not know :D

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Missing something - Slow time to first token #10141

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Missing something - Slow time to first token #10141

Uh oh!

icsy7867 Nov 2, 2024

Replies: 0 comments

icsy7867
Nov 2, 2024