How to Think About TPUs | How To Scale Your Model #4
Replies: 21 comments 35 replies
-
In the solution for question 5, it looks like bytes transferred should be 1.7e7 rather than 1.7e10, and transfer time should be 170µs rather than 170ms |
Beta Was this translation helpful? Give feedback.
-
Multiplication is more expensive than an addition right? In multiplying a matrix when we say the number of operations is ~ 2 x B x D x F (BDF for multiplication, and B(D-1)F for addition)?, do we consider multiplication and addition as equally expensive ? |
Beta Was this translation helpful? Give feedback.
-
In question 5, the answer mentions v4e; I think it should be v5e. Also, I don't understand why the total transfer time is multiplied by the number of hops. Streaming from one core to another should take num_bytes / bandwidth + latency_of_first_byte, which is approximately num_bytes / bandwidth when num_bytes is large. This means it should take about num_bytes / bandwidth = 170 us for the whole transfer. |
Beta Was this translation helpful? Give feedback.
-
There is a typo in the line "In each frame above, we multiply all the overlapped green and blue units, sum the result with any residual passed in from above, and then pass the result in turn down one unit...". It should be "In each frame below" instead of "In each frame above" |
Beta Was this translation helpful? Give feedback.
-
H100 spec says:
But the text sasy:
int8 is about 8x faster than FP16. Any idea why the difference between 2x on H100 and 8x on TPU v5e? Thanks! |
Beta Was this translation helpful? Give feedback.
-
I am still grappling with the idea of when I should use TPU (say V6e) versus an Nvidia B200. The computation FLOPs/s is higher for B200 (4500 TFLOPs v 920 TFLOPs, bf16), HBM bandwidth is higher for B200 (8TB/s v 1.6TB/s), HBM size is higher for B200 (192GB v 32GB), interconnect is faster for B200 (NVlink at 1800GB/s v ICI BW at 180GB/s)... Hence, I am a bit confused on the tradeoff. Isn't B200 outperforming on all fronts (except costs perhaps due to switch costs)? |
Beta Was this translation helpful? Give feedback.
-
Answer 5. "the first byte will arrive in about 6us and the total transfer will take 188us." Wouldn't the total transfer time will be 188 + 6us = 194us? |
Beta Was this translation helpful? Give feedback.
-
What's the difference between a "slice" and a "pod"? The article simultaneously states:
|
Beta Was this translation helpful? Give feedback.
-
Although ICI connects the TPUs within a pod (e.g., 8,076 TPUs), it would not be sufficient to directly connect a much larger number of TPUs—say 30,000 or even 100,000—as required for extremely large-scale training. In such scenarios, how are these TPUs interconnected? Are all the TPUs connected individually to the DCN (e.g., via Ethernet) using NICs, similar to how NVIDIA connects GPUs to a scale-out network despite having an NVLink network? Or do the OCS switches connecting the 8,076-TPU pod also interface with DCN switches, forming a hierarchical structure? This hierarchical approach would mean there isn’t a completely separate scale-up (intra-pod) and scale-out (inter-pod) network, as in NVIDIA’s systems, but rather a unified network with a layered design. Could you clarify? |
Beta Was this translation helpful? Give feedback.
-
I have several questions trying to connect a high-level picture (FLOPs, Bandwidths, etc) with the details of how systolic arrays work.
Where does the "8" come from? I like to think that systolic arrays perform a single matrix-vector multiplication per clock cycle, is that a bad mental model?
p.s. Thanks for the fantastic guide! |
Beta Was this translation helpful? Give feedback.
-
I might be missing something obvious here, forgive me. The link you provided for the v5p lists the Interchip Interconnect BW (ICI) as 4800Gbps (600GB/s): https://cloud.google.com/tpu/docs/v5p#system_architecture If it has 6 way interconnect, wouldn't that be 100GB/s? Where does 90GB/s come from? |
Beta Was this translation helpful? Give feedback.
-
This is really insightful. Minor nit: the equation in the answer of Q4 has lhs max{T_math, T_comm} but rhs max{T_comm, T_math} |
Beta Was this translation helpful? Give feedback.
-
Could you give a brief description of how the Optical Switches (OCS) compare & contrast against the Electrical Switches that are the standard in datacenters today? Is it mostly savings for Google or are there advantages that ML practitioners can benefit from? |
Beta Was this translation helpful? Give feedback.
-
I see above that ICI bidirectional bandwidth per link for TPU v6e (Trillium) is 180 GB/s. For the newly introduced Ironwood, the ICI bidirectional bandwidth per link is 1.2Tbps or 1200/8= 150 GB/s. But the Ironwood announcement says the ICI for Ironwood is 1.5x of Trillium. How shall I reconcile or I am mis understanding here. |
Beta Was this translation helpful? Give feedback.
-
I'm a bit confused by the ICI calculation in Q6. Does it assume 15GB is on adjacent TPUs, so 7.5GB can travel on each link in one hop? The way I reasoned about this was that the upper bound would be transferring 1GB from one corner to the other, which would be 6 hops, so 1e9 / 9e10 ~= 11ms * 6 ~= 66ms to get the final GB to the destination TPU. This doesn't factor in latency calculations. |
Beta Was this translation helpful? Give feedback.
-
This is phenomenal reading thank you so much for writing this!! |
Beta Was this translation helpful? Give feedback.
-
CS336 (Spring 2025) has a great video about the modern GPU architecture (a nice addendum to the Appendix A): https://www.youtube.com/watch?v=6OBtO9niT00 |
Beta Was this translation helpful? Give feedback.
-
Thanks for the great doc! I am curious about If we could opposite the padding size, as a JAX user, how could we update our code accordingly? Thank you! |
Beta Was this translation helpful? Give feedback.
-
In exercise three how come We don't additionally have to compute the latency to go from HBM to the VMEM. Is this because it's much smaller than the HBM latency that it doesn't matter? |
Beta Was this translation helpful? Give feedback.
-
How do we have a bidirectional speed w/o wraparound in question 6. It says in the article that the latter is necessary for the former. |
Beta Was this translation helpful? Give feedback.
-
Hi! What a great post! I really enjoyed reading them! I believe it would be a great resource for the AI community here in Korea. With your team's permission, I would love to translate it and share it with them.
Of course, I would ensure that full credit and a link back to your original article are prominently featured. Thank you for your consideration. Best regards, |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Discussion about TPUs!
Beta Was this translation helpful? Give feedback.
All reactions