All About Rooflines | How To Scale Your Model #3

jacobaustin123 · 2025-02-03T02:21:33Z

jacobaustin123
Feb 3, 2025
Maintainer

Discussions about rooflines!

manishravula · 2025-02-05T00:51:12Z

manishravula
Feb 5, 2025 — with giscus

For the T_math in the distributed gemm case, if the flops are (BDF/2) and the compute remains the same, where do we get the extra 2s in the numerator and the denominator? (if that was representing the aggregate compute across two TPUs, then that should be made clear somewhere as well?)

1 reply

jacobaustin123 Feb 5, 2025 — with giscus
Maintainer Author

So on a single TPU the compute would be 2BDF (BDF multiplies and BF(D-1) adds, technically). Split across two TPUs, each does half this amount, so it’s 2BDF/2 per chip, and 2BF/2 bytes transferred from each chip

xmfbit · 2025-02-05T08:40:21Z

xmfbit
Feb 5, 2025 — with giscus

Nice work, thank you. But I am confused that why the roofline doesn't pass through the origin? Considering that Real FLOPS/s = min(Hardware FLOPs/s, BW * AI)

1 reply

fedelebron Feb 5, 2025 — with giscus
Collaborator

Roofline plots are traditionally done in log-log, which is why there's no "zero". We'll make an edit to clarify that, thanks!

kishorepv · 2025-02-06T02:53:06Z

kishorepv
Feb 6, 2025 — with giscus

When deriving the 240 (or ~ 500 when using GPU) threshold for batch size B, under the assumption B << D, does this threshold vary significantly when using a consumer-grade GPU (say RTX 3090 etc.) versus enterprise-grade GPU (say H100) ?

1 reply

jacobaustin123 Feb 6, 2025 — with giscus
Maintainer Author

I'm not an expert on GPUs but e.g. RTX 3090 claims to support 268e12 FP16 FLOPs and have a memory bandwidth of 936e9, which would give us a critical batch size of roughly 286 (source), so close to half that of the A100. Each generation will likely have a slightly different value depending on what workloads NVIDIA is trying to make efficient

meetrais · 2025-02-06T16:17:20Z

meetrais
Feb 6, 2025 — with giscus

To remember, I took personal note for Part-1 as below. Hope my understanding is correct.

High Arithmetic Intensity = Compute Bound
This is because if operations/calculations are of high arithmetic intensity then it will keep FLOPs busy longer, resulting in compute-bound. It wont require much data transfer.

Low Arithmetic Intensity = Bandwidth Bound
This is because if operations/calculations are of low arithmetic intesity then FLOPs will get free quikly and will require quick data transfer to FLOPs which are idle. This makes it bandwidth-bound because higher GB/s speed will give better result.

0 replies

sanagno · 2025-02-07T12:20:42Z

sanagno
Feb 7, 2025 — with giscus

Unless I am mistaken, the reported FLOPs/s number of "1.98e15 bfloat16" for the H100, corresponds to operations with sparsity. The corresponding FLOPs/s with dense operations, should be half of the reported one (see e.g. https://developer.nvidia.com/blog/nvidia-hopper-architecture-in-depth/).

1 reply

jacobaustin123 Feb 7, 2025
Maintainer Author

Thanks for noting this, updated.

damek · 2025-02-07T14:59:22Z

damek
Feb 7, 2025 — with giscus

In the matrix multiplication section you're using B for a matrix and a shape parameter for the matrix A. Probably want to change one of them :).

1 reply

jacobaustin123 Feb 7, 2025
Maintainer Author

Good call, fixed. Will update in a moment.

zhipengzhaocmu · 2025-02-07T19:29:54Z

zhipengzhaocmu
Feb 7, 2025 — with giscus

Is this a typo? 1e12 / 9.89e14 = 1.01us and 1e12 / 9.1e14 = 1.1ms The first us should all be ms.

1 reply

jacobaustin123 Feb 7, 2025
Maintainer Author

Yes, good catch. Just updated this and it slipped through. Fixed now!

kirachy · 2025-02-08T08:23:27Z

kirachy
Feb 8, 2025 — with giscus

In the roofline figure above, the boundary between the compute bound (green) and bandwidth bound(pink) should start at the point where the accelerator flops flattens, right? Why is it not that way? Kindly explain.

5 replies

jacobaustin123 Feb 8, 2025
Maintainer Author

This is a mistake. The correct figure is something like

I'll update the website.

sajjad2881 Feb 21, 2025 — with giscus

This is really a small (and embarrassing) detail but it took me a couple of minutes to realize that the vertical blue dotted lines were referring to the 2 algorithms. Making that text blue might make it easier for future readers :)

iankur Mar 4, 2025 — with giscus

@JacobAustin can you please explain how does the pink line achieve peak FLOPs / s with lesser arithmetic intensity than critical hardware intensity, i.e. the point where it becomes flat comes before the green point?

iankur Mar 5, 2025 — with giscus

@jacobaustin123 sorry I just realized I tagged wrong person

@JacobAustin can you please explain how does the pink line achieve peak FLOPs / s with lesser arithmetic intensity than critical hardware intensity, i.e. the point where it becomes flat comes before the green point?

jacobaustin123 Mar 5, 2025
Maintainer Author

The "critical hardware intensity" depends on the bandwidth. So you should talk about the "critical hardware intensity with respect to HBM". For example, if your weights are stored in CPU RAM, the bandwidth is way lower so you have a much higher critical intensity. TPUs have a higher bandwidth memory called VMEM which has much higher bandwidth and thus much lower critical intensity

Shua1 · 2025-02-10T00:28:19Z

Shua1
Feb 10, 2025 — with giscus

When reading the example of partitioned matmul over two TPUs:
I was confused by why we don't need to do a X1 x Y0 and X0 x Y1.
And turns out I just need to refresh my algebra knowledge.

X is split horizaontally into
X0 = X[:, :D // 2]
X1 = X[:, D//2:]
Y is split veritically into
Y0 = Y[:D//2, :]
Y1 = Y[D//2:, :]

[X0, X1] x [Y0, Y1]^t is simply X0xY0 + X1+Y1.

1 reply

findmyway Mar 13, 2025 — with giscus

Yeah, usually it is easier to think about the tensor partition in the "vector" form.

Shua1 · 2025-02-10T04:50:23Z

Shua1
Feb 10, 2025 — with giscus

It might be worth pointing out an intuition can be derived from equation 8 that communcation is domainated by loading model parameters. With this insight the quantization question could be easily answered with intuition.

0 replies

sumit-2020 · 2025-02-10T04:50:32Z

sumit-2020
Feb 10, 2025 — with giscus

In the roofline plot, BW1 and BW2 are depicted with the same (or very similar) slope. Wouldn't we expect to see a higher slope for the BW2 plot compared to the BW1 plot? 🤔

2 replies

Shua1 Feb 10, 2025 — with giscus

Why is that?

jacobaustin123 Feb 11, 2025 — with giscus
Maintainer Author

They would be in linear scale. In log scale they have the same slot but different intercepts

Byungsooo · 2025-02-11T22:41:28Z

Byungsooo
Feb 11, 2025 — with giscus

Eq.7 says that Intensity=FLOPs / BYTES_LOAD_N_WRITE, before it gets simplified in Eq.8. It is a bit counter-intuitive for me when we say the intensity is reduced (=I thought this means that it's easier to accommodate given a fixed compute, maybe I'm wrong) when we have more number of bytes to load and write. Can someone help me understand this equation?

2 replies

jacobaustin123 Feb 12, 2025
Maintainer Author

When you say "the intensity is reduced", what are you referencing? When we have more bytes to read/write, intensity is decreased which makes it harder to run on hardware since it has fewer FLOPs to overlap

Byungsooo Feb 12, 2025 — with giscus

Thanks. Yes, I was referring to having more bytes to read/write. After reading the paragraph again, it becomes clear what it meant - when intensity(computation) gets lower and gets smaller than intensity(accelerator), we are dominated by communication and waste computes.

mohit-shrma · 2025-02-22T20:27:29Z

mohit-shrma
Feb 22, 2025 — with giscus

Regarding Answer to Q4, shouldn't we have B * BDF as number of flops as we perform BD * DF matrix multiplication B times?

1 reply

jacobaustin123 Feb 24, 2025
Maintainer Author

Yes, here I mean per-device.

pattrsn · 2025-03-01T21:48:25Z

pattrsn
Mar 1, 2025 — with giscus

Formula (5) denominator is GB. I believe "Bytes" or B is more accurate

1 reply

jacobaustin123 Mar 10, 2025 — with giscus
Maintainer Author

Fixed on GitHub. Will update the website today. Thanks!

tuananhle7 · 2025-03-13T02:36:49Z

tuananhle7
Mar 13, 2025 — with giscus

I'm confused about the roofline figure. I would expect the "bend" in the roof to appear after the critical hardware intensity, not before.

Here is how I am thinking about it, please let me know what I'm missing.

Let

Algorithm's computation [FLOPS] ... F
Algorithm's communication [GB] ... G
Accelerator's max computation speed [FLOPS/s] ... C_max
Accelerator's max bandwidth [GB/s] ... B_max
Actual bandwidth [GB/s] ... B (this would be either BW1 or BW2 in the plot).

Then

Accelerator's intensity [FLOPS / GB] ... I(H) = C_max / B_max (critical hardware intensity in the figure)
Algorithm's intensity [FLOPS / GB] ... I(A) = F / G
Computation time [s] ... T_comp = F / C_max
Communication time [s] ... T_comm = G / B.

There are two cases:

Compute-bound case.

This happens when T_comp > T_comm <=> F / C_max > G / B <=> I(A) = F / G > C_max / B = C_max / B_max * B_max / B = I(H) B_max / B.
Here, the realized compute speed C = F / T_comp = F / (F / C_max) = C_max.
That is, if we plot C against I(A), it would be a constant (C_max) and it would start when I(A) > I(H) B_max / B, i.e. to the right of I(H) since B_max / B >= 1.

Communication-bound case.

This is when T_comp < T_comm, i.e. when I(A) < I(H) B_max / B.
Here, the realized compute speed C = F / T_comm = F / (G / B) = (F / G) B = I(A) B.
That is, if we plot C against I(A), it's a linear plot with slope B.

Therefore, the full (linear-linear) plot of C against I(A) is a line starting at the origin, going up linearly with slope B until I(A) = I(H) B_max / B, then staying constant at C_max.

1 reply

jacobaustin123 Mar 13, 2025 — with giscus
Maintainer Author

The bend appears exact at the critical intensity. The critical intensity is with respect to a particular bandwidth, so here that should say "Critical intensity with respect to BW1"

sshkhr · 2025-04-02T04:03:36Z

sshkhr
Apr 2, 2025 — with giscus

Thanks for the great book! Just a minor nitpick, regarding the choice of conflicting comparatives:

In practice, we optimize against the maximum as the algebra is simpler and we can usually come close to this bound by overlapping our communication and computation.

I understand that the maximum of $T_{math}, T_{comms}$ is the "maximum" being referred to in this statement. But is is technically the lower bound on time, so for a bit I was confused why the word maximum was used

1 reply

tb5874 Apr 2, 2025 — with giscus

i thought like below.

for 2 bytes
Outside to Compute core : 1 bytes/s
Compute core : ∞ FLOPs/s
Outside to Compute core : T_comms = 2 sec
Compute core : T_math ≈ 0 sec

The moment when Ouside delivers the last 1 byte to the Compute core is T_comms
It then requires T_math time after receiving it.
So, it must be at least T_comms ( + alpha )
Therefore maximum time and lower bound is T_comms

when you got it,
I hope let me know if there are other good interpretations.

tb5874 · 2025-04-02T04:53:18Z

tb5874
Apr 2, 2025 — with giscus

Thanks for your kind explanation.
I'm confused about example.
please let me know what I'm missing.

[ Visualizing rooflines part ]
...
We can generally improve the performance of an algorithm either by increasing its arithmetic intensity or by increasing the memory bandwidth available (moving from BW1 to BW2).
...
i understand 'by increasing its arithmetic intensity' but another is confuse ...
BW2 mean increase bandwidth. That make more deliver communication bytes. therefore increase FLOPs/s.
i just confuse why and how BW1, BW2 have same accelerator peak FLOPs/s ...
this mean when i select TPU, check bandwidth with arithmetic intensity of an algorithm ?

[ Network communication rooflines part ]
i thought total communication time is like above Matrix multiplication part. ( Matrix multiplication did load and back )
but Network communication rooflines part, start with data and did not back to HBM and silently did allgather.
i thought T_math need to include allgather FLOPs, and T_comms need to include load and back.

0 replies

jonassvedas · 2025-04-15T10:01:30Z

jonassvedas
Apr 15, 2025 — with giscus

For Q2:

bfloat16[B, D] * int8[D, F] -> bfloat16[B, F]

I think the answer is 4BDF FLOPs and not 2BDF

Because my assumption is that multiplying int8 with bfloat is still a 2FLOPS operation, hence:

FLOPs = 2  *  ( (  D + ( D-1) )  *  BF )    =>   2*2DBF = 4BDF
      BF16       mult     add     cells

2 replies

jacobaustin123 Apr 15, 2025
Maintainer Author

Firstly, the 2 doesn't come from bf16, it comes from the multiply + add (D + D + 1). Total FLOPs is still 2BDF, there's no extra factor of 2. The only place we'd get an extra factor of 2 is in the number of bytes being loaded in bf16.

jonassvedas Apr 16, 2025

Ah right, so regardless of the data type its all count as one FLOP. Thanks for the explanation.

parambole · 2025-05-31T18:08:07Z

parambole
May 31, 2025 — with giscus

I am trying to fully grasp the fundamental basis of the Roofline model, specifically the comparison between an algorithm's Arithmetic Intensity and the hardware's Intensity(Accelerator).

My confusion comes from this point:

The Arithmetic Intensity of an Algorithm is defined as Total FLOPs / Total Bytes Communicated. This ratio represents the computational density of the workload itself – for every byte of data my algorithm moves, it inherently performs 'X' FLOPs. This seems like a direct relationship within the algorithm's execution.

However, Intensity(Accelerator) for hardware, like the TPU v5e's 240 FLOPs/byte, is defined as Peak FLOPs/s / Peak Memory Bandwidth (Bytes/s). This seems to be a ratio of maximum hardware rates, not a direct statement that 'to perform 240 FLOPs, one byte is necessarily required by the hardware.'

Given these distinct definitions, why is a direct comparison between I_algo and I_accel a valid or meaningful way to determine if a computation is compute-bound or memory-bound? How do these two distinct ratios, one from the algorithm's inherent properties and the other from the hardware's peak capabilities, mathematically and physically connect to justify their direct comparison in the Roofline model?

Could you please elaborate on the mathematical and physical justification for this comparison and why it is the correct analytical approach for understanding bottlenecks?

0 replies

parambole · 2025-05-31T18:34:20Z

parambole
May 31, 2025 — with giscus

Thus we become compute-bound when our local batch size is greater than 240 tokens, a very simple rule!

Mathematically, B is defined as the number of rows in X[B,D]. But the conclusion ("compute-bound when our local batch size is greater than 240 tokens") uses B to mean "local batch size in tokens."

How does B (number of rows) directly translate to "local batch size in tokens" in the context?

3 replies

jacobaustin123 Jun 1, 2025
Maintainer Author

Generally Transformers are sequence models, so we think of a batch of tokens having shape [B, T] where B is the "sequence batch size" and T is the "sequence length". However, from the perspective of a matmul, the B/T distinction doesn't matter, so the "local batch size in tokens" is B * T, or more precisely whatever part of that lives on a given TPU (in the context of sharding discussed in section 5). But generally you can think of it as just the number of tokens.

kiankyars Jul 26, 2025 — with giscus

Please explain the rationale behind assuming b is small relative to T so you can eliminate it

kiankyars Jul 26, 2025 — with giscus

Nevermind, you explain it after. Wondering why embedding dimension isn't being considered here.

macleginn · 2025-06-04T08:49:55Z

macleginn
Jun 4, 2025 — with giscus

Question 3: For the problem above, make a roofline plot of peak FLOPs vs. B for several values of D and F.

The roofline plot in the example is FLOPs/s vs. FLOPs/s. In the answer to the question, we can derive FLOPs/s from B on the x axis, but the y axis should probably still be FLOPs/s?

0 replies

yejingxin · 2025-06-11T21:00:15Z

yejingxin
Jun 11, 2025 — with giscus

For the network roofline analysis, Say we can copy 4.5e10 bytes in each direction and perform 1.97e14 FLOPs/s on each chip, the 1.97e14 FLOPs/s is v5e peak flops, consider v5e 2d torus, 1600 Gbps ici bw, how do we obtain 4.5e10 bytes/s?

0 replies

davidxia · 2025-07-03T20:23:47Z

davidxia
Jul 3, 2025 — with giscus

Would love an answer to Q3 if anyone has it. 🙏

0 replies

davidxia · 2025-07-03T20:29:04Z

davidxia
Jul 3, 2025 — with giscus

What does the dot-D in equations like X[B,D]⋅ D Y[D,F]→Z[B,F] mean?

2 replies

davidxia Jul 3, 2025 — with giscus

Pretty sure it means matmul, but I'm unfamiliar with this notation.

jacobaustin123 Jul 7, 2025 — with giscus
Maintainer Author

Footnote 10 describes this. Yes, it means matmul, and yes, it's a kind of einsum notation.

kiankyars · 2025-07-26T12:51:34Z

kiankyars
Jul 26, 2025 — with giscus

T_math is described as comp_FLOPS/acceleratorFLOPS/S, what does FLOPS/S mean? shouldnt We just divide the Total flops we need to compute divided by the accelerator flops.

1 reply

jacobaustin123 Jul 26, 2025
Maintainer Author

FLOPs ("Floating point OPerations") is a unit of total floating point operations performed. FLOPs/s is the number of FLOPs the hardware can perform every second. The ratio of the two is time.

jimlinntu · 2025-07-27T23:44:58Z

jimlinntu
Jul 27, 2025 — with giscus

Question 3: For the problem above, make a roofline plot of peak FLOPs vs. B for several values of D and F.

Can someone clarify what does Q3 mean? I am not quite sure what is this question looking for.

If we are using Question 2's setting:

FLOPs = 2BDF (for bfloat16 FLOPs)
Load+Write = 2BD + DF + 2BF ~= DF (Assuming DF >> BF and DF >> BD)

This shows that the arithmetic intensity is 2BDF / DF = 2B which is independent of D and F.

If D and F do not affect the arithmetic intensity, I don't know why would we want to plot for "several values of D and F" when they are not affecting the plot. In other words, my understanding is the peak FLOPS/sec is always 1.97e14 FLOPs/sec no matter what D and F we pick when 2B >= 1.97e14 / 8.1e11

I guess maybe this question just want us to plot peak FLOPs/sec as Y axis and batch size as X axis as an exercise and assume we can just pick some random D and F (it is mentioned in the question that only plot "a" single plot) since they do not matter, if I am reading it correctly.

3 replies

jimlinntu Jul 28, 2025

Replying to myself. Once thing I notice from: arithmetic_intensity = 2BDF / (2BD + DF + 2BF)

If I pick a random D and F to be D = F = 128, the arithmetic intensity will be 256B / (4B + 128) which means that as B goes to infinity, the arithmetic intensity will converge to 256/4 which is 64 and is always smaller than 1.97e14/8.1e11 = 243.20.

This means that for some D and F, no matter how you pick the batch size, the compute will always be memory-bound.

Maybe this is insight this question is looking for.

kiankyars Jul 28, 2025 — with giscus

it is using questions 2's setting, it says "For the problem above"

jacobaustin123 Jul 28, 2025
Maintainer Author

I agree this question is unclear in what it's asking. I've updated it to use the interpretation you've given, namely: "Taking the setup from Question 2, make a roofline plot of peak FLOPs vs. $B$ for $F = D = 4096$ and $F = D = 1024$. Use the exact number of bytes loaded, not an approximation."

I think this is a more interesting question, since it calls out where this approximation fails, and gives you a nicer plot.

All About Rooflines | How To Scale Your Model #3

Uh oh!

jacobaustin123 Feb 3, 2025 Maintainer

Replies: 26 comments · 31 replies

Uh oh!

manishravula Feb 5, 2025 — with giscus

Uh oh!

Uh oh!

jacobaustin123 Feb 5, 2025 — with giscus Maintainer Author

Uh oh!

xmfbit Feb 5, 2025 — with giscus

Uh oh!

fedelebron Feb 5, 2025 — with giscus Collaborator

Uh oh!

kishorepv Feb 6, 2025 — with giscus

Uh oh!

jacobaustin123 Feb 6, 2025 — with giscus Maintainer Author

Uh oh!

meetrais Feb 6, 2025 — with giscus

Uh oh!

sanagno Feb 7, 2025 — with giscus

Uh oh!

jacobaustin123 Feb 7, 2025 Maintainer Author

Uh oh!

damek Feb 7, 2025 — with giscus

Uh oh!

jacobaustin123 Feb 7, 2025 Maintainer Author

Uh oh!

zhipengzhaocmu Feb 7, 2025 — with giscus

Uh oh!

jacobaustin123 Feb 7, 2025 Maintainer Author

Uh oh!

kirachy Feb 8, 2025 — with giscus

Uh oh!

Uh oh!

jacobaustin123 Feb 8, 2025 Maintainer Author

Uh oh!

sajjad2881 Feb 21, 2025 — with giscus

Uh oh!

iankur Mar 4, 2025 — with giscus

Uh oh!

iankur Mar 5, 2025 — with giscus

Uh oh!

jacobaustin123 Mar 5, 2025 Maintainer Author

Uh oh!

Shua1 Feb 10, 2025 — with giscus

Uh oh!

findmyway Mar 13, 2025 — with giscus

Uh oh!

Shua1 Feb 10, 2025 — with giscus

Uh oh!

sumit-2020 Feb 10, 2025 — with giscus

Uh oh!

Shua1 Feb 10, 2025 — with giscus

Uh oh!

jacobaustin123 Feb 11, 2025 — with giscus Maintainer Author

Uh oh!

Uh oh!

Byungsooo Feb 11, 2025 — with giscus

Uh oh!

jacobaustin123 Feb 12, 2025 Maintainer Author

Uh oh!

Byungsooo Feb 12, 2025 — with giscus

Uh oh!

mohit-shrma Feb 22, 2025 — with giscus

Uh oh!

jacobaustin123 Feb 24, 2025 Maintainer Author

Uh oh!

pattrsn Mar 1, 2025 — with giscus

Uh oh!

jacobaustin123
Feb 3, 2025
Maintainer

Replies: 26 comments 31 replies

manishravula
Feb 5, 2025 — with giscus

jacobaustin123 Feb 5, 2025 — with giscus
Maintainer Author

xmfbit
Feb 5, 2025 — with giscus

fedelebron Feb 5, 2025 — with giscus
Collaborator

kishorepv
Feb 6, 2025 — with giscus

jacobaustin123 Feb 6, 2025 — with giscus
Maintainer Author

meetrais
Feb 6, 2025 — with giscus

sanagno
Feb 7, 2025 — with giscus

jacobaustin123 Feb 7, 2025
Maintainer Author

damek
Feb 7, 2025 — with giscus

jacobaustin123 Feb 7, 2025
Maintainer Author

zhipengzhaocmu
Feb 7, 2025 — with giscus

jacobaustin123 Feb 7, 2025
Maintainer Author

kirachy
Feb 8, 2025 — with giscus

jacobaustin123 Feb 8, 2025
Maintainer Author

jacobaustin123 Mar 5, 2025
Maintainer Author

Shua1
Feb 10, 2025 — with giscus

Shua1
Feb 10, 2025 — with giscus

sumit-2020
Feb 10, 2025 — with giscus

jacobaustin123 Feb 11, 2025 — with giscus
Maintainer Author

Byungsooo
Feb 11, 2025 — with giscus

jacobaustin123 Feb 12, 2025
Maintainer Author

mohit-shrma
Feb 22, 2025 — with giscus

jacobaustin123 Feb 24, 2025
Maintainer Author

pattrsn
Mar 1, 2025 — with giscus