All the Transformer Math You Need to Know | How To Scale Your Model #6

In the section What Should You Take Away from this Section?, you write "The parameter count of the MLP block dominates the total parameter count and the MLP block also dominates the FLOPs budget until the sequence length $T > 8D$."
Question: Don't you mean "as long as the sequence length $T<8D$."? Just because when $T>8D$ then computing the attention matrix becomes the leading term in FLOPs.
In the Appendix A: How does Flash Attention work? at the beginning you say While it’s true that the attention QK product has shape $[B, S, T, H]$ where B is the batch size, S and T are the Q and K sequence dims, and H is the number of heads..." but above you defined $H$ as the attention head dimension, not the num of heads.
Question: Don't't you mean $N$? That's the number of attention heads.

1 reply

jacobaustin123 Feb 11, 2025
Maintainer Author

(1) Good point, I've reworded this as you suggested. (2) Yes. Fixed. Thank you!

rahulbatra85 · 2025-02-10T21:47:13Z

rahulbatra85
Feb 10, 2025 — with giscus

In the transformer decoder architecture figure, the term K is overloaded(K=XWk and also K=number of KV heads).

1 reply

jacobaustin123 Feb 11, 2025
Maintainer Author

@levskaya I have been running into this too, maybe we pick another symbol? N for total, M for KV heads maybe, as in "MQA". Maybe it's fine as is too, it's not that confusing

kerrywang · 2025-02-20T06:39:37Z

kerrywang
Feb 20, 2025 — with giscus

In the transformer decoder architecture plot. G is query heads per kv heads. I wonder should it instead be reversed? i.e. KV heads per Query Heads? Given group query attention we are grouping the query so query heads should always be smaller than KV heads?

1 reply

jacobaustin123 Feb 20, 2025
Maintainer Author

No, the key is to have fewer KV heads than query heads. We group query heads that attend to the same KV head.

FL33TW00D · 2025-02-24T11:49:29Z

FL33TW00D
Feb 24, 2025 — with giscus

From a hardware standpoint, this lets us fit our chunk of Q into VMEM (what the algorithm above calls on-chip SRAM) so we only have to load the KV chunks on each iteration, reducing the arithmetic intensity. We can also keep the running statistics in VMEM.

I might be incorrect here, but I was under the impression flash attention increased arithmetic intensity by reducing the number of memory accesses?

1 reply

jacobaustin123 Feb 24, 2025
Maintainer Author

I think in this case it's ultimately the same thing: if you load the same Qs or Ks multiple times, you're increasing the bytes loaded (reducing arithmetic intensity) and adding more, smaller memory accesses.

maximilianmbeck · 2025-04-01T08:25:07Z

maximilianmbeck
Apr 1, 2025 — with giscus

typo in appendix?
In the first equation (where you define $S_{ij}$). Shouldn't be the index of $k$ in the denominator also $k$ (i.e. the sum index)?
I think in your last equation (where you describe the local contraction) it is correct. Thanks for this great book!

0 replies

ohodson · 2025-06-13T10:26:11Z

ohodson
Jun 13, 2025 — with giscus

Great resource, thank you.

The second reference on the Transformer Math page looks like it should have the author "Shazeer, N." rather than "Noam, S.".

0 replies

hammersam · 2025-06-20T09:45:32Z

hammersam
Jun 20, 2025 — with giscus

Thanks for sharing! Great resource for learning LLM.

0 replies

wenchenvincent · 2025-06-23T04:05:29Z

wenchenvincent
Jun 23, 2025 — with giscus

In the section "What Should You Take Away from this Section?", the Training FLOPs per layer for Vocab is shown as 12BTDV. I think it should be 6BTDV, and it should be total.

2 replies

jacobaustin123 Jun 23, 2025 — with giscus
Maintainer Author

Agree it's total. It's 12BTDV because the embedding is used for both embedding and un-embedding (the output projection). We call this weight tying. This means each parameter is used twice.

wenchenvincent Jun 28, 2025 — with giscus

That makes sense. In this case, you might want to include both embedding and un-embedding in the Section "Global FLOPs and Params Calculation: Other Operations" to make it consistent.

All the Transformer Math You Need to Know | How To Scale Your Model #6

Uh oh!

jacobaustin123 Feb 3, 2025 Maintainer

Replies: 9 comments · 11 replies

Uh oh!

main-horse Feb 4, 2025 — with giscus

Uh oh!

Uh oh!

main-horse Feb 4, 2025 — with giscus

Uh oh!

jacobaustin123 Feb 4, 2025 — with giscus Maintainer Author

Uh oh!

spacewander Feb 10, 2025 — with giscus

Uh oh!

jacobaustin123 Feb 10, 2025 Maintainer Author

Uh oh!

spacewander Feb 11, 2025

Uh oh!

Uh oh!

burichh Feb 10, 2025 — with giscus

Uh oh!

jacobaustin123 Feb 11, 2025 Maintainer Author

Uh oh!

rahulbatra85 Feb 10, 2025 — with giscus

Uh oh!

jacobaustin123 Feb 11, 2025 Maintainer Author

Uh oh!

kerrywang Feb 20, 2025 — with giscus

Uh oh!

jacobaustin123 Feb 20, 2025 Maintainer Author

Uh oh!

FL33TW00D Feb 24, 2025 — with giscus

Uh oh!

jacobaustin123 Feb 24, 2025 Maintainer Author

Uh oh!

maximilianmbeck Apr 1, 2025 — with giscus

Uh oh!

ohodson Jun 13, 2025 — with giscus

Uh oh!

hammersam Jun 20, 2025 — with giscus

Uh oh!

wenchenvincent Jun 23, 2025 — with giscus

Uh oh!

jacobaustin123 Jun 23, 2025 — with giscus Maintainer Author

Uh oh!

wenchenvincent Jun 28, 2025 — with giscus

jacobaustin123
Feb 3, 2025
Maintainer

Replies: 9 comments 11 replies

main-horse
Feb 4, 2025 — with giscus

jacobaustin123 Feb 4, 2025 — with giscus
Maintainer Author

jacobaustin123 Feb 10, 2025
Maintainer Author

burichh
Feb 10, 2025 — with giscus

jacobaustin123 Feb 11, 2025
Maintainer Author

rahulbatra85
Feb 10, 2025 — with giscus

jacobaustin123 Feb 11, 2025
Maintainer Author

kerrywang
Feb 20, 2025 — with giscus

jacobaustin123 Feb 20, 2025
Maintainer Author

FL33TW00D
Feb 24, 2025 — with giscus

jacobaustin123 Feb 24, 2025
Maintainer Author

maximilianmbeck
Apr 1, 2025 — with giscus

ohodson
Jun 13, 2025 — with giscus

hammersam
Jun 20, 2025 — with giscus

wenchenvincent
Jun 23, 2025 — with giscus

jacobaustin123 Jun 23, 2025 — with giscus
Maintainer Author