Training LLaMA 3 on TPUs | How To Scale Your Model #8

jacobaustin123 · 2025-02-03T02:23:09Z

jacobaustin123
Feb 3, 2025
Maintainer

Training LLAMA on TPUs!

Serega6678 · 2025-02-11T22:54:50Z

Serega6678
Feb 11, 2025 — with giscus

Thank you for this section!
There is a type in the "Question: LLaMA 3-70B was pretrained with a batch size of about 4M tokens..."
In the table in the answer, the Total is incorrect, after the table Total seems correct

Thanks for this wonderful book again!

1 reply

jacobaustin123 Feb 12, 2025 — with giscus
Maintainer Author

Fixed!

dhaya · 2025-02-26T10:39:44Z

dhaya
Feb 26, 2025 — with giscus

In the answer to the 4th question:

44 days * 8960 / 171

The denominator should be 225,

2 replies

dhaya Feb 26, 2025 — with giscus

Also the same question mentions "gradients 4 times per layer", but the answer assumes 3.

jacobaustin123 Mar 3, 2025
Maintainer Author

Fixed. Thank you!

tengyifei · 2025-04-06T05:51:23Z

tengyifei
Apr 6, 2025 — with giscus

What's the meaning of "number of gradient checkpoints" -- is it the factor of the total size of residuals compared to the size of the layer input?

1 reply

jacobaustin123 Apr 7, 2025
Maintainer Author

https://jax-ml.github.io/scaling-book/transformers/#gradient-checkpointing should explain it.

kerrywang · 2025-04-14T23:09:53Z

kerrywang
Apr 14, 2025 — with giscus

When computing the T_math, should we, in practice, also take into consideration of TPU utilization?

eg: in the LLaMA 70B, we assumed 40% FLOPs utilization, should we instead compute $$T_{math} = \frac {B \cdot D \cdot F} {N \cdot C \cdot \text{FLOPS UTILIZATION}}$$

1 reply

jacobaustin123 Apr 14, 2025
Maintainer Author

Generally no, while LLAMA 70B might have an aggregate FLOPs utilization of 40%, the matmuls themselves will have very good FLOPs utilization (upwards of 90% often). The low overall utilization (esp during decoding) comes from things like attention, layer-norms, and imperfect ICI overlapping.

In other words, you should include that discount factor when saying total llama training time = total llama FLOPs / (total llama FLOPs/s * 0.4) but not when talking about a specific matmul.

henryhmko · 2025-07-10T01:09:58Z

henryhmko
Jul 10, 2025 — with giscus

Thanks for making an amazing resource! I just noticed two small typos:

Question 4

1752 days to train. That's 6 and a half years.

To "4.8 years"

How to shard Llama3-70B intro

with 4M token batch size (1024 sequences of length 8192 per batch)

To "1024 sequences of length 4096 per batch" from the Llama3 paper

1 reply

jacobaustin123 Jul 10, 2025 — with giscus
Maintainer Author

Fixed. Thanks for flagging these.

Training LLaMA 3 on TPUs | How To Scale Your Model #8

Uh oh!

jacobaustin123 Feb 3, 2025 Maintainer

Replies: 5 comments · 6 replies

Uh oh!

Serega6678 Feb 11, 2025 — with giscus

Uh oh!

jacobaustin123 Feb 12, 2025 — with giscus Maintainer Author

Uh oh!

dhaya Feb 26, 2025 — with giscus

Uh oh!

dhaya Feb 26, 2025 — with giscus

Uh oh!

jacobaustin123 Mar 3, 2025 Maintainer Author

Uh oh!

tengyifei Apr 6, 2025 — with giscus

Uh oh!

jacobaustin123 Apr 7, 2025 Maintainer Author

Uh oh!

kerrywang Apr 14, 2025 — with giscus

Uh oh!

jacobaustin123 Apr 14, 2025 Maintainer Author

Uh oh!

henryhmko Jul 10, 2025 — with giscus

Uh oh!

jacobaustin123 Jul 10, 2025 — with giscus Maintainer Author

jacobaustin123
Feb 3, 2025
Maintainer

Replies: 5 comments 6 replies

Serega6678
Feb 11, 2025 — with giscus

jacobaustin123 Feb 12, 2025 — with giscus
Maintainer Author

dhaya
Feb 26, 2025 — with giscus

jacobaustin123 Mar 3, 2025
Maintainer Author

tengyifei
Apr 6, 2025 — with giscus

jacobaustin123 Apr 7, 2025
Maintainer Author

kerrywang
Apr 14, 2025 — with giscus

jacobaustin123 Apr 14, 2025
Maintainer Author

henryhmko
Jul 10, 2025 — with giscus

jacobaustin123 Jul 10, 2025 — with giscus
Maintainer Author