You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have tried with and without flash attn, but it doest seem to affect anything. As a comparison, I tried koboldcpp which seems to be pretty close to the llamacpp implementation.
Using koboldCPP I have a fairly similar setup... (I have tried within a container and directly on my host to make sure that wasnt a variable...)
Generally with llamacpp, I am experiencing about 30-40 seconds before I start getting a streaming reply. (I am using chat completion FWIW...)
With kobold I am seeing about 1-6 seconds before getting a response. For Llamacpp I have tried several different versions. To make sure I pulled the latest llama.cpp:server-cuda to make sure the container was up to date at least.
So I am just curious if there was something stupid I am missing? Apologies if this is a dumb questions, I know just enough to know that I do not know :D
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
Hello! I think I am missing something. I am trying to test out llamacpp but I am getting very long delays before getting a response.
I have tried a few different things and I read through the docs, but I think I might be missing something.
I am currently using podman with an NVIDIA Quadro P6000, and I am using:
ghcr.io/ggerganov/llama.cpp:server-cuda
I have tried with and without flash attn, but it doest seem to affect anything. As a comparison, I tried koboldcpp which seems to be pretty close to the llamacpp implementation.
Using koboldCPP I have a fairly similar setup... (I have tried within a container and directly on my host to make sure that wasnt a variable...)
Generally with llamacpp, I am experiencing about 30-40 seconds before I start getting a streaming reply. (I am using chat completion FWIW...)
With kobold I am seeing about 1-6 seconds before getting a response. For Llamacpp I have tried several different versions. To make sure I pulled the latest llama.cpp:server-cuda to make sure the container was up to date at least.
So I am just curious if there was something stupid I am missing? Apologies if this is a dumb questions, I know just enough to know that I do not know :D
Beta Was this translation helpful? Give feedback.
All reactions