All About Rooflines | How To Scale Your Model #3
Replies: 26 comments 31 replies
-
For the |
Beta Was this translation helpful? Give feedback.
-
Nice work, thank you. But I am confused that why the roofline doesn't pass through the origin? Considering that |
Beta Was this translation helpful? Give feedback.
-
When deriving the 240 (or ~ 500 when using GPU) threshold for batch size B, under the assumption B << D, does this threshold vary significantly when using a consumer-grade GPU (say RTX 3090 etc.) versus enterprise-grade GPU (say H100) ? |
Beta Was this translation helpful? Give feedback.
-
To remember, I took personal note for Part-1 as below. Hope my understanding is correct. High Arithmetic Intensity = Compute Bound Low Arithmetic Intensity = Bandwidth Bound |
Beta Was this translation helpful? Give feedback.
-
Unless I am mistaken, the reported FLOPs/s number of "1.98e15 bfloat16" for the H100, corresponds to operations with sparsity. The corresponding FLOPs/s with dense operations, should be half of the reported one (see e.g. https://developer.nvidia.com/blog/nvidia-hopper-architecture-in-depth/). |
Beta Was this translation helpful? Give feedback.
-
In the matrix multiplication section you're using B for a matrix and a shape parameter for the matrix A. Probably want to change one of them :). |
Beta Was this translation helpful? Give feedback.
-
Is this a typo? |
Beta Was this translation helpful? Give feedback.
-
In the roofline figure above, the boundary between the compute bound (green) and bandwidth bound(pink) should start at the point where the accelerator flops flattens, right? Why is it not that way? Kindly explain. |
Beta Was this translation helpful? Give feedback.
-
When reading the example of partitioned matmul over two TPUs: X is split horizaontally into [X0, X1] x [Y0, Y1]t is simply X0xY0 + X1+Y1. |
Beta Was this translation helpful? Give feedback.
-
It might be worth pointing out an intuition can be derived from equation 8 that communcation is domainated by loading model parameters. With this insight the quantization question could be easily answered with intuition. |
Beta Was this translation helpful? Give feedback.
-
In the roofline plot, BW1 and BW2 are depicted with the same (or very similar) slope. Wouldn't we expect to see a higher slope for the BW2 plot compared to the BW1 plot? 🤔 |
Beta Was this translation helpful? Give feedback.
-
Eq.7 says that Intensity=FLOPs / BYTES_LOAD_N_WRITE, before it gets simplified in Eq.8. It is a bit counter-intuitive for me when we say the intensity is reduced (=I thought this means that it's easier to accommodate given a fixed compute, maybe I'm wrong) when we have more number of bytes to load and write. Can someone help me understand this equation? |
Beta Was this translation helpful? Give feedback.
-
Regarding Answer to Q4, shouldn't we have B * BDF as number of flops as we perform BD * DF matrix multiplication B times? |
Beta Was this translation helpful? Give feedback.
-
Formula (5) denominator is GB. I believe "Bytes" or B is more accurate |
Beta Was this translation helpful? Give feedback.
-
I'm confused about the roofline figure. I would expect the "bend" in the roof to appear after the critical hardware intensity, not before. Here is how I am thinking about it, please let me know what I'm missing. Let
Then
There are two cases:
Therefore, the full (linear-linear) plot of C against I(A) is a line starting at the origin, going up linearly with slope B until I(A) = I(H) B_max / B, then staying constant at C_max. |
Beta Was this translation helpful? Give feedback.
-
Thanks for the great book! Just a minor nitpick, regarding the choice of conflicting comparatives:
I understand that the maximum of |
Beta Was this translation helpful? Give feedback.
-
Thanks for your kind explanation. [ [ |
Beta Was this translation helpful? Give feedback.
-
For Q2: bfloat16[B, D] * int8[D, F] -> bfloat16[B, F] I think the answer is 4BDF FLOPs and not 2BDF Because my assumption is that multiplying int8 with bfloat is still a 2FLOPS operation, hence:
|
Beta Was this translation helpful? Give feedback.
-
I am trying to fully grasp the fundamental basis of the Roofline model, specifically the comparison between an algorithm's Arithmetic Intensity and the hardware's Intensity(Accelerator). My confusion comes from this point: The Arithmetic Intensity of an Algorithm is defined as Total FLOPs / Total Bytes Communicated. This ratio represents the computational density of the workload itself – for every byte of data my algorithm moves, it inherently performs 'X' FLOPs. This seems like a direct relationship within the algorithm's execution. However, Intensity(Accelerator) for hardware, like the TPU v5e's 240 FLOPs/byte, is defined as Peak FLOPs/s / Peak Memory Bandwidth (Bytes/s). This seems to be a ratio of maximum hardware rates, not a direct statement that 'to perform 240 FLOPs, one byte is necessarily required by the hardware.' Given these distinct definitions, why is a direct comparison between I_algo and I_accel a valid or meaningful way to determine if a computation is compute-bound or memory-bound? How do these two distinct ratios, one from the algorithm's inherent properties and the other from the hardware's peak capabilities, mathematically and physically connect to justify their direct comparison in the Roofline model? Could you please elaborate on the mathematical and physical justification for this comparison and why it is the correct analytical approach for understanding bottlenecks? |
Beta Was this translation helpful? Give feedback.
-
Mathematically, B is defined as the number of rows in X[B,D]. But the conclusion ("compute-bound when our local batch size is greater than 240 tokens") uses B to mean "local batch size in tokens." How does B (number of rows) directly translate to "local batch size in tokens" in the context? |
Beta Was this translation helpful? Give feedback.
-
The roofline plot in the example is FLOPs/s vs. FLOPs/s. In the answer to the question, we can derive FLOPs/s from B on the x axis, but the y axis should probably still be FLOPs/s? |
Beta Was this translation helpful? Give feedback.
-
For the network roofline analysis, Say we can copy 4.5e10 bytes in each direction and perform 1.97e14 FLOPs/s on each chip, the 1.97e14 FLOPs/s is v5e peak flops, consider v5e 2d torus, 1600 Gbps ici bw, how do we obtain 4.5e10 bytes/s? |
Beta Was this translation helpful? Give feedback.
-
Would love an answer to Q3 if anyone has it. 🙏 |
Beta Was this translation helpful? Give feedback.
-
What does the dot-D in equations like |
Beta Was this translation helpful? Give feedback.
-
T_math is described as comp_FLOPS/acceleratorFLOPS/S, what does FLOPS/S mean? shouldnt We just divide the Total flops we need to compute divided by the accelerator flops. |
Beta Was this translation helpful? Give feedback.
-
Can someone clarify what does Q3 mean? I am not quite sure what is this question looking for. If we are using
This shows that the arithmetic intensity is If I guess maybe this question just want us to plot |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Discussions about rooflines!
Beta Was this translation helpful? Give feedback.
All reactions