Can we use both the CPU and GPU(and may the NE) on unified memory systems(Mac)? #3083

JRZS · 2023-09-08T15:57:29Z

JRZS
Sep 8, 2023

I'm running 8-bit quantized Llama 2 and have a 99% utilized GPU, 12 performance cores idle, as well as an idle neural engine. Could we use the existing code for dividing work up on the CPU and GPU concurrently? Could the links below allow use of the Neural Engine too?
https://developer.apple.com/library/archive/documentation/Performance/Conceptual/vDSP_Programming_Guide/Introduction/Introduction.html
https://developer.apple.com/documentation/accelerate/veclib

I'd like to squeeze all the fixed-point operations per second I can out of my M2, and it seems we have the ability to run on the CPU and the GPU, but no code path to handle both, or with the Neural Engine, all. How challenging is this to do?

JRZS · 2023-09-08T20:00:55Z

JRZS
Sep 8, 2023
Author

Here's what the utilization graphs look like:

If it's feasible to hand chunks of the compute to not only GPU but also CPU workers, we could potentially get better utilization. I'm not familiar with where in the code this change would take place, but am happy to help there.

4 replies

KerfuffleV2 Sep 8, 2023
Collaborator

I might be off base here, but isn't that what -ngl is for? You can control how many layers are handled by the CPU vs how many are handled by the GPU.

Since LLM evaluation tends to be memory bandwidth limited and Mac uses unified memory there might not really be an advantage though. Once you're saturating your memory bandwidth that's probably just going to set the limit for performance.

JRZS Sep 8, 2023
Author

Thanks. :) I don't know that I'm hitting memory bandwidth limitations yet. So if I have 38 GPU cores and 12 Performant CPU cores, besides setting n_gpu_layers to 38, what setting tells it to also run on 12 CPU cores?

KerfuffleV2 Sep 8, 2023
Collaborator

Generally you'd use as high a -ngl as you have available VRAM for (or whatever the Mac equivalent is). It seem like you might be thinking assign a core to each layer and they can run in parallel? It doesn't really work like that.

Processing the prompt can happen in parallel (but not structured like a core per layer). However when actually generating tokens each layer depends on the one before it so you can only process layer 2 after you have the result from layer 1 and so on. So a different approach is required to allow parallelizing the calculation, like splitting up the tensors, etc. (I don't know the exact details.)

Anyway, -t sets the number of CPU threads, -ngl sets how many layers to offload to the GPU and the "threading" part there gets handled automatically. If you're already offloading everything to the GPU (you didn't mention which model you're using so I'm not sure how much of it 38 layers accounts for) then setting the threads to a high value is probably actually going to hurt performance.

On my system (AMD x86) I have 8 real cores but setting threads higher than 5-6 even when not offloading hurts performance. That's because it's more of a memory bandwidth bottleneck rather than computing stuff.

JRZS Sep 9, 2023
Author

@KerfuffleV2 Thanks, I'm not saying the cores should each get a layer (dependent calculations wouldn't allow a speed up), I'm asking if there's a path to have both the CPU and the GPU (plus the NE if possible) cores all used when doing the tensor math for a layer. Unlike other processor architectures, the apple silicon has unified memory with the GPU cores on the same chip as the CPU cores, so they should be able to read/write to the same RAM, and CPU's aren't terrible at 8bit math.

I'm running 8-bit quantized 70B param Llama 2 and have an M2 Max (4 efficiency cores, 12 performance cores, and 38 GPU cores) with 96GB. I'd like to use both the GPU and CPU cores, together, but when I use -ngl &-t, I only see the GPUs in use (see the image above, light blue is full GPU utilization).

ianscrivener · 2023-09-08T22:41:51Z

ianscrivener
Sep 8, 2023

README.md#metal-build documentation indicates that for MacOS that -ngl is just a True/False.. NOT a count of layers...

"When built with Metal support, you can explicitly disable GPU inference with the --gpu-layers|-ngl 0 command-line argument."

4 replies

KerfuffleV2 Sep 8, 2023
Collaborator

So this would be effectively like always offloading all layers and one would want to set threads to 1 or some low value. Correct?

ianscrivener Sep 9, 2023

Unsure - I'm just echoing the official documentation.

Personally, I get best performance with 7B & 13B models on 16Gb M2 using -ngl 1 -t 8... which uses the 4 economy CPUs,8 CPUs and Metal GPU.

JRZS Sep 9, 2023
Author

@ianscrivener when you run the activity monitor, and look at GPU and CPU utilization while running the 7B or 13B are you seeing the CPU running with the GPU?

See my image above, I only ever get GPU usage, and that's with ngl=1 t=16 and ngl=38 t=16 on an M2Max with 4 efficiency cores, 12 performance cores, and 38 GPU cores.

ianscrivener Sep 9, 2023

@JRZS, yes both CPU and GPU are used..
almost seems like CPU at beginning and end, and GPU in the middle... though it all happens so quick (<2 second)

okpatil4u · 2023-09-09T09:58:05Z

okpatil4u
Sep 9, 2023

In the case of speculative sampling, would it be possible to offload the larger model to GPU while the smaller model(s) are utilising CPUs ?

…

— Reply to this email directly, view it on GitHub <#3083 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAXGU4BACEHGRAKP2OXKF6LXZQD27ANCNFSM6AAAAAA4QSVWPA> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.*** com>

0 replies

ggerganov · 2023-09-09T10:04:37Z

ggerganov
Sep 9, 2023
Maintainer

This is not supported on Mac and it is very unlikely to bring any benefit even if it was supported.

Best thing you can try is to run a large LLM on the GPU (i.e. -ngl 1) and in parallel a small LLM on the CPU (i.e. -ngl 0). I haven't tried this and it is very likely again to not bring significant speedup, but it's something that has to be tested. (see @okpatil4u comment above)

The main reason is that the memory bandwidth of the chip is shared between the CPU and GPU (AFAIK), so if you already saturated it with the GPU, then the CPU won't help

2 replies

JRZS Sep 9, 2023
Author

Thanks, and agree if memory bandwidth is topped out we're done! :) Out of curiosity, do you know how to get perf info on the memory bandwidth utilization?

PhilippeFerreiraDeSousa Sep 16, 2023

For training we could use CPU and GPU though

Can we use both the CPU and GPU(and may the NE) on unified memory systems(Mac)? #3083

Uh oh!

JRZS Sep 8, 2023

Replies: 4 comments · 10 replies

Uh oh!

JRZS Sep 8, 2023 Author

Uh oh!

KerfuffleV2 Sep 8, 2023 Collaborator

Uh oh!

JRZS Sep 8, 2023 Author

Uh oh!

KerfuffleV2 Sep 8, 2023 Collaborator

Uh oh!

Uh oh!

JRZS Sep 9, 2023 Author

Uh oh!

Uh oh!

ianscrivener Sep 8, 2023

Uh oh!

KerfuffleV2 Sep 8, 2023 Collaborator

Uh oh!

ianscrivener Sep 9, 2023

Uh oh!

Uh oh!

JRZS Sep 9, 2023 Author

Uh oh!

ianscrivener Sep 9, 2023

Uh oh!

okpatil4u Sep 9, 2023

Uh oh!

ggerganov Sep 9, 2023 Maintainer

Uh oh!

JRZS Sep 9, 2023 Author

Uh oh!

PhilippeFerreiraDeSousa Sep 16, 2023

JRZS
Sep 8, 2023

Replies: 4 comments 10 replies

JRZS
Sep 8, 2023
Author

KerfuffleV2 Sep 8, 2023
Collaborator

JRZS Sep 8, 2023
Author

KerfuffleV2 Sep 8, 2023
Collaborator

JRZS Sep 9, 2023
Author

ianscrivener
Sep 8, 2023

KerfuffleV2 Sep 8, 2023
Collaborator

JRZS Sep 9, 2023
Author

okpatil4u
Sep 9, 2023

ggerganov
Sep 9, 2023
Maintainer

JRZS Sep 9, 2023
Author