All About Transformer Inference | How To Scale Your Model #9
Replies: 7 comments 9 replies
-
In the first figure for sharding KV comparing MHA and MQA, I think the locations of the reduce-scatter and all-gather are reversed. |
Beta Was this translation helpful? Give feedback.
-
Had question and was trying to understand more.
why do we need our batch size also to be fairly large, it seems like the equation is defined by token length; so isn't that sufficient "Why is this, conceptually? Mainly, we’re compute-bound in linear portions of the model because the parameters (the memory bandwidth-heavy components) are reused for many batch items. However, every batch item has its own KV cache, so a bigger batch size means more KV caches. We will almost always be memory bound here unless the architecture is adjusted aggressively." another question why the batch size effects this being memory bound; doesn't batch size cancel out here in the equations you're showing; I get the point that t=1 for generation, and since we have to fetch the kv cache stuff for s, we're still doing kind of half the memory work for that but we lose the s*t multiplication for the flopers inference. |
Beta Was this translation helpful? Give feedback.
-
I have a question on the Generation section and the calculations around time for HBM loading. It seems to only consider the weight loading time and not also the KV cache loading time as well. Does this need to be added to the calculation ? |
Beta Was this translation helpful? Give feedback.
-
About the part of "exact number will differ on the kind of quantization and hardware".If we use int8 to store parameters but bf16 for computation, why is the result of critical 120? According to the following formulas: For int8, the communication will decrease to half, then the critical B will be 480? |
Beta Was this translation helpful? Give feedback.
-
For the formula:
I am unsure where the For one layer, given:
Total params for one layer:
Total params across all layers:
|
Beta Was this translation helpful? Give feedback.
-
Shouldn't there be positional embedding matrix (of shape = [max_seq_len, d_model]) as well, along with ffw+vocab+attn params? |
Beta Was this translation helpful? Give feedback.
-
For the newly defined model we're using in the worked problems, should the number of heads be 16 instead of 32, or qk dim be 128 instead of 256? Right now, d_qkv * n_heads is 8192, instead of matching d_model at 4096. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Serving big models!
Beta Was this translation helpful? Give feedback.
All reactions