You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Using Llama.cpp on Mac is such a pain because the prompt evaluation time is incredibly slow.
Look at these three outputs:
CPU (Ryzen 5600) only on Linux with 13b q6k:
llama_print_timings: prompt eval time = 1146.71 ms / 111 tokens ( 10.33 ms per token, 96.80 tokens per second)
llama_print_timings: eval time = 35998.84 ms / 127 runs ( 283.46 ms per token, 3.53 tokens per second)
GPU (4090) on Linux with 13b q6k:
llama_print_timings: prompt eval time = 225.93 ms / 111 tokens ( 2.04 ms per token, 491.30 tokens per second)
llama_print_timings: eval time = 1895.24 ms / 127 runs ( 14.92 ms per token, 67.01 tokens per second)
M1 with 13b q4km (GPU or CPU speed is almost the same on M1):
llama_print_timings: prompt eval time = 13767.84 ms / 111 tokens ( 124.03 ms per token, 8.06 tokens per second)
llama_print_timings: eval time = 22384.10 ms / 127 runs ( 176.25 ms per token, 5.67 tokens per second)
On Linux, the prompt evaluation time for CPUs is 27.4 times faster than token generation, while GPUs are 7.3 times faster. However, on Mac, it is only 1.4 times faster. This discrepancy doesn't make sense.
Additionally, Ryzen 5600 on Linux is reported to be 12 times faster than the CPU of M1 in terms of prompt evaluation, which is also puzzling. From what I understand, these two CPUs have similar performance in various benchmarks.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Using Llama.cpp on Mac is such a pain because the prompt evaluation time is incredibly slow.
Look at these three outputs:
CPU (Ryzen 5600) only on Linux with 13b q6k:
GPU (4090) on Linux with 13b q6k:
M1 with 13b q4km (GPU or CPU speed is almost the same on M1):
On Linux, the prompt evaluation time for CPUs is 27.4 times faster than token generation, while GPUs are 7.3 times faster. However, on Mac, it is only 1.4 times faster. This discrepancy doesn't make sense.
Additionally, Ryzen 5600 on Linux is reported to be 12 times faster than the CPU of M1 in terms of prompt evaluation, which is also puzzling. From what I understand, these two CPUs have similar performance in various benchmarks.
Why?
Beta Was this translation helpful? Give feedback.
All reactions