Token Latency decreases from 6s/token to 130ms/token with -ngl set to 1 on Mac M1 Metal GPU. #10638
Replies: 1 comment
-
solved, the magic is the shared memory managed by Metal. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Dear all,
I am facing a problem that I don't understand, thank you for your warmhearted help here.
My laptop is Mac M1 with 8GB of physical memory, 4 E-cores, 4 P-cores, the commit version of llama.cpp I used is 6374743. I am running Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf on llama.cpp, compiled with Metal GPU.
That's really a magic that, offloading only 1 layer from CPU to GPU can cause 98% decrease in token latency on Metal GPU.
However, things go differently on CUDA GPU. For example:
Everything seems reasonable on CUDA, but on Mac M1 with Metal GPU, set -ngl to 1 is like pressing the "SPEED MODE" button, making the CPUs extremely high performance.
I use ``powermetrics --samplers cpu_power -i 1000" to monitor the CPU usage. P-cores have a high overload, so P-cores have been used by default.
I am now very confusing what the "SPEED MODE" button is, why setting -ngl=1 can speedup inference from 6s/token to 130ms/token?
Thank you!
Beta Was this translation helpful? Give feedback.
All reactions