Memory bandwidth utilization #3909

artmoskvin · 2023-11-02T12:48:44Z

artmoskvin
Nov 2, 2023

Hi all! I'm trying to understand the current memory bandwidth utilization (MBU) for llama.cpp running on M2 Max. When running 7B q4 model, I get around 60 tok/s which based on this blog post corresponds to ~50% MBU. Here's my math for the reference:

60 tok/s => 16.6 ms per output token (i.e. TPOT)
7B q4 model => 3.5 GB
3.5 GB per 16.6 ms => 210 GB per sec
210 GB/s out of 400 Gb/s (M2 Max memory bandwidth) => 52.5%

Is my math wrong? Or are there any limitations on unified memory usage?

Green-Sky · 2023-11-02T13:49:47Z

Green-Sky
Nov 2, 2023
Collaborator

you also have to take the KV-cache into account.
for the 7B for a context size of 4096 in f16, we have kv self size = 2048.00 MB

6 replies

ggerganov Nov 2, 2023
Maintainer

In addition to that, Q4_0 7B is 3825806912 bytes which is 3.82 GB (1GB = 1^9 bytes).

Also, we have some non-negligible overhead in the metal implementation that we haven't figured out how to overcome. If you make the kernels NOP (i.e. return; on first line), you will still see some significant time needed run the computation. I've forgot the numbers, but last time I looked into this, it seemed like an overhead which cannot be eliminated.

And on top of all this, it is not clear that the GPU alone can utilize the theoretical 400 GB/s bandwidth. AFAIK it is shared with the CPU in some way, where the CPU has some portion of it. But this is just a hypothesis

artmoskvin Nov 2, 2023
Author

Thank you both for replying! KV-cache is definitely something that I missed but in my case it was smaller kv self size = 256.00 MB because I used the default context n_ctx = 512. Adding this up to 3.82 GB gives us ~4GB which corresponds to 240 GB/s, i.e. 60% MBU against 52.5% before.

I think there's something else. Some articles (1, 2) mention that each token generation requires 2 * num_model_params FLOPS because matmul is (1) multiply and (2) add. But I don't believe it results in loading model parameters twice into GPU cache/registers thanks to kernel fusion.

I'm leaning toward the overhead in the metal implementation or limitations from OS. Or maybe there's a flaw in my estimates :)

shouyiwang Nov 5, 2023

@ggerganov Which one would you recommend between the M1 Ultra with 48 GPUs and 128GB RAM, and the M3 Max with 40 GPUs and 128GB RAM? It appears that the M1 Ultra has double bandwidth compared to the M3 Max, but will it actually be faster?

ggerganov Nov 5, 2023
Maintainer

Prefer not to give recommendations. Probably it is best to wait a week or two so we see the M3 results and you can decide based on the numbers.

cduk Mar 30, 2024

Is there now a definitive conclusion on this?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Memory bandwidth utilization #3909

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 6 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Memory bandwidth utilization #3909

Uh oh!

artmoskvin Nov 2, 2023

Replies: 1 comment · 6 replies

Uh oh!

Uh oh!

Green-Sky Nov 2, 2023 Collaborator

Uh oh!

ggerganov Nov 2, 2023 Maintainer

Uh oh!

artmoskvin Nov 2, 2023 Author

Uh oh!

shouyiwang Nov 5, 2023

Uh oh!

ggerganov Nov 5, 2023 Maintainer

Uh oh!

cduk Mar 30, 2024

artmoskvin
Nov 2, 2023

Replies: 1 comment 6 replies

Green-Sky
Nov 2, 2023
Collaborator

ggerganov Nov 2, 2023
Maintainer

artmoskvin Nov 2, 2023
Author

ggerganov Nov 5, 2023
Maintainer