All About Transformer Inference | How To Scale Your Model #9

jacobaustin123 · 2025-02-03T02:23:25Z

jacobaustin123
Feb 3, 2025
Maintainer

Serving big models!

clu0 · 2025-02-11T05:07:43Z

clu0
Feb 11, 2025 — with giscus

In the first figure for sharding KV comparing MHA and MQA, I think the locations of the reduce-scatter and all-gather are reversed.

0 replies

rohanmahajan1993 · 2025-02-26T03:37:10Z

rohanmahajan1993
Feb 26, 2025 — with giscus

Had question and was trying to understand more.

"For prefill, since we’re doing self-attention, so this simplifies to
This is great because it means the arithmetic intensity of attention during prefill is
Θ(T). That means it’s quite easy to be compute-bound for attention. As long as our batch size and sequence length are both fairly large, we’ll be fine!"

why do we need our batch size also to be fairly large, it seems like the equation is defined by token length; so isn't that sufficient

"Why is this, conceptually? Mainly, we’re compute-bound in linear portions of the model because the parameters (the memory bandwidth-heavy components) are reused for many batch items. However, every batch item has its own KV cache, so a bigger batch size means more KV caches. We will almost always be memory bound here unless the architecture is adjusted aggressively."

another question why the batch size effects this being memory bound; doesn't batch size cancel out here in the equations you're showing; I get the point that t=1 for generation, and since we have to fetch the kv cache stuff for s, we're still doing kind of half the memory work for that but we lose the s*t multiplication for the flopers inference.

2 replies

jacobaustin123 Mar 3, 2025
Maintainer Author

I think this should basically just say "as long as our sequence length is fairly large". I think before this was support to say "if our batch size is large our MLPs are good, and if our sequence length is long, our attention is good".
Could you restate this slightly? The idea is, for our MLPs, we load one matrix for B entries, so the FLOPs / byte loaded increases with batch size (since our activations are small). For attention, as we increase B, we load B times more KV caches, so our intensity doesn't improve.

rohanmahajan1993 Mar 3, 2025 — with giscus

On 2, it was totally my bad. I think initially I misread and thought the text was implying that increasing batch size decreases the intensity of the attention but in reality the point it was that it doesn't change the attention part.

mmcclean-aws · 2025-03-06T19:46:03Z

mmcclean-aws
Mar 6, 2025 — with giscus

I have a question on the Generation section and the calculations around time for HBM loading. It seems to only consider the weight loading time and not also the KV cache loading time as well. Does this need to be added to the calculation ?

3 replies

jacobaustin123 Mar 6, 2025
Maintainer Author

Do you mean under "Distributing Inference Over Multiple Accelerators"? I think in general here we're looking at the matmul vs. ICI comms overlap, which is unrelated to the KV caches. Those will always be memory-bound and more model parallelism will always improve things

mmcclean-aws Mar 6, 2025

Am referring to this formula. The time for HBM comms is only considering model weights and not KV cache loading time. For long sequences the KV cache can be larger than the model weights so should also be taken into consideration

jacobaustin123 Mar 6, 2025
Maintainer Author

Think about what we're trying to calculate here. We want to know how much model parallelism we can do before ici communication during our matmuls takes longer than weight loading. At that point, we hit a limit and more model parallelism won't reduce the time our matmuls take. It will reduce the time KV loading takes, so if all you care about is latency, you can do even more.

So I guess the point is, this is the rule for how much model parallelism you can do before you start hitting diminishing returns on throughput at a fixed batch size. But you can go beyond it if you're willing to take a throughpout hit.

Lhongpei · 2025-03-07T04:08:30Z

Lhongpei
Mar 7, 2025 — with giscus

About the part of "exact number will differ on the kind of quantization and hardware".If we use int8 to store parameters but bf16 for computation, why is the result of critical 120？ According to the following formulas:

$$ \frac{2BDF}{2BD + 2DF + 2BF} \geq \frac{\text{TPU FLOPs/s}}{\text{HBM Bandwidth}} = \frac{1.97E+14}{8.20E+11} = 240 $$

For int8, the communication will decrease to half, then the critical B will be 480?

1 reply

jacobaustin123 Mar 7, 2025
Maintainer Author

Not quite. If we simplify, we end up with 2B > 240, or B > 120.

aasthavar · 2025-03-24T13:22:14Z

aasthavar
Mar 24, 2025 — with giscus

For the formula:

FFW params = d_model^2 * ffw_multiplier * 3 * n_layers

I am unsure where the *3* came from? Shouldn't it be just *2*?

For one layer, given:

Input Projection:
- W_in.shape = [d_model, d_mlp]
- b_in.shape = [d_mlp]
GeLU
- no weights
Output Projection:
- W_out.shape = [d_mlp, d_model]
- b_out.shape = [d_model]

Total params for one layer:

params = (d_model * d_mlp) + (d_mlp * d_model) 
       = (d_model^2) * ffw_multiplier * 2

Total params across all layers:

total_params = params * n_layers 
             = (d_model^2) * ffw_multiplier * 2 * n_layers

1 reply

jacobaustin123 Mar 24, 2025
Maintainer Author

So this is a little confusing in this book because we assume what you describe in Section 5 for simplicity.

However, GeGeLU, the gated variant described in https://arxiv.org/abs/2002.05202v1 has two up-projection matrices whose output is pointwise multiplied. There’s a note on this in Section 4.

aasthavar · 2025-03-24T13:32:08Z

aasthavar
Mar 24, 2025 — with giscus

Shouldn't there be positional embedding matrix (of shape = [max_seq_len, d_model]) as well, along with ffw+vocab+attn params?

1 reply

jacobaustin123 Mar 24, 2025
Maintainer Author

Sometimes. Many models today use some kind of parameter free positional embedding, or where the number of params is very small, eg ROPE, T5-style relative positional embedding, etc.

Either way, the number is so small it essentially doesn’t matter.

VijayGKR · 2025-05-07T20:25:39Z

VijayGKR
May 7, 2025 — with giscus

For the newly defined model we're using in the worked problems, should the number of heads be 16 instead of 32, or qk dim be 128 instead of 256? Right now, d_qkv * n_heads is 8192, instead of matching d_model at 4096.

1 reply

jacobaustin123 May 7, 2025
Maintainer Author

Probably yes, maybe worth updating. Still, it's not a strict rule that d_qkv * n_heads = d_model. The main thing that matters is the number n_kv heads

All About Transformer Inference | How To Scale Your Model #9

Uh oh!

jacobaustin123 Feb 3, 2025 Maintainer

Replies: 7 comments · 9 replies

Uh oh!

clu0 Feb 11, 2025 — with giscus

Uh oh!

rohanmahajan1993 Feb 26, 2025 — with giscus

Uh oh!

jacobaustin123 Mar 3, 2025 Maintainer Author

Uh oh!

rohanmahajan1993 Mar 3, 2025 — with giscus

Uh oh!

mmcclean-aws Mar 6, 2025 — with giscus

Uh oh!

jacobaustin123 Mar 6, 2025 Maintainer Author

Uh oh!

mmcclean-aws Mar 6, 2025

Uh oh!

jacobaustin123 Mar 6, 2025 Maintainer Author

Uh oh!

Lhongpei Mar 7, 2025 — with giscus

Uh oh!

jacobaustin123 Mar 7, 2025 Maintainer Author

Uh oh!

aasthavar Mar 24, 2025 — with giscus

Uh oh!

jacobaustin123 Mar 24, 2025 Maintainer Author

Uh oh!

aasthavar Mar 24, 2025 — with giscus

Uh oh!

jacobaustin123 Mar 24, 2025 Maintainer Author

Uh oh!

VijayGKR May 7, 2025 — with giscus

Uh oh!

jacobaustin123 May 7, 2025 Maintainer Author

jacobaustin123
Feb 3, 2025
Maintainer

Replies: 7 comments 9 replies

clu0
Feb 11, 2025 — with giscus

rohanmahajan1993
Feb 26, 2025 — with giscus

jacobaustin123 Mar 3, 2025
Maintainer Author

mmcclean-aws
Mar 6, 2025 — with giscus

jacobaustin123 Mar 6, 2025
Maintainer Author

jacobaustin123 Mar 6, 2025
Maintainer Author

Lhongpei
Mar 7, 2025 — with giscus

jacobaustin123 Mar 7, 2025
Maintainer Author

aasthavar
Mar 24, 2025 — with giscus

jacobaustin123 Mar 24, 2025
Maintainer Author

aasthavar
Mar 24, 2025 — with giscus

jacobaustin123 Mar 24, 2025
Maintainer Author

VijayGKR
May 7, 2025 — with giscus

jacobaustin123 May 7, 2025
Maintainer Author