Replies: 2 comments 1 reply
-
Those numbers seem normal for CPU. But I don't see if you use OpenBLAS/Accelerate that would speed up the prompt evaluation. A smaller context will also be faster. I also think that you may be a little too close on running out of memory as well. |
Beta Was this translation helpful? Give feedback.
0 replies
-
@SlyEcho, I read that the problem is in 4bit quantization. For the memory its working fine, only used about 10Gb, do you use GPU for llama.cpp ? |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Here's the prompt :
Here's the command :
Here's the result :
Question : why is that llama.cpp very slow at parsing (not so) large prompt ? I have tried using --mlock but no different whatsoever, is there anything I did wrongly ?
My system : Mac mini m2 pro 16gb
TIA
Edit : the slowness I think is in prompt eval time, I just found out that if I'm using a simple prompt, maybe 100ish tokens, it takes some delay ex :
It will wait sometimes before --User: is appear
Beta Was this translation helpful? Give feedback.
All reactions