Replies: 3 comments 5 replies
-
You could try running it with nsight systems to see where it is spending all this time. Does this also happen with |
Beta Was this translation helpful? Give feedback.
-
llm_load_tensors: offloading 0 repeating layers to GPU |
Beta Was this translation helpful? Give feedback.
-
Thanks for the help -- Is CLIP encoding accelerated on Metal systems? |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi,
I've built llama.cpp with
make LLAMA_CUBLAS=1
. I'm usingserver
and seeing incredibly slow performance that makes me suspect something is amiss. I'm running on an A100 with 80GB RAM (Runpod.io). The logs seem to indicate that the GPU is being utilized:prompt eval time is 0.02 tokens/second.
Any help debugging this or understanding the timing breakdown and why this system isn't performing would be very helpful. On my M1 MacBook w/ 32GB RAM, it absolutely screams, but it's using an entirely different backend (Metal).
Thank you,
Bart
Beta Was this translation helpful? Give feedback.
All reactions