All the Transformer Math You Need to Know | How To Scale Your Model #6
Replies: 9 comments 11 replies
-
a scant few minutes ago, the mathjax for this chapter's introduction was not rendering, but it appears that it was fixed live 😅 |
Beta Was this translation helpful? Give feedback.
-
Two remarks:
|
Beta Was this translation helpful? Give feedback.
-
In the transformer decoder architecture figure, the term K is overloaded(K=XWk and also K=number of KV heads). |
Beta Was this translation helpful? Give feedback.
-
In the transformer decoder architecture plot. G is query heads per kv heads. I wonder should it instead be reversed? i.e. KV heads per Query Heads? Given group query attention we are grouping the query so query heads should always be smaller than KV heads? |
Beta Was this translation helpful? Give feedback.
-
I might be incorrect here, but I was under the impression flash attention increased arithmetic intensity by reducing the number of memory accesses? |
Beta Was this translation helpful? Give feedback.
-
typo in appendix? |
Beta Was this translation helpful? Give feedback.
-
Great resource, thank you. The second reference on the Transformer Math page looks like it should have the author "Shazeer, N." rather than "Noam, S.". |
Beta Was this translation helpful? Give feedback.
-
Thanks for sharing! Great resource for learning LLM. |
Beta Was this translation helpful? Give feedback.
-
In the section "What Should You Take Away from this Section?", the Training FLOPs per layer for Vocab is shown as 12BTDV. I think it should be 6BTDV, and it should be total. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Discussing the Transformer architecture!
Beta Was this translation helpful? Give feedback.
All reactions