Training LLaMA 3 on TPUs | How To Scale Your Model #8
Replies: 5 comments 6 replies
-
Thank you for this section! Thanks for this wonderful book again! |
Beta Was this translation helpful? Give feedback.
-
In the answer to the 4th question:
The denominator should be 225, |
Beta Was this translation helpful? Give feedback.
-
What's the meaning of "number of gradient checkpoints" -- is it the factor of the total size of residuals compared to the size of the layer input? |
Beta Was this translation helpful? Give feedback.
-
When computing the T_math, should we, in practice, also take into consideration of TPU utilization? eg: in the LLaMA 70B, we assumed 40% FLOPs utilization, should we instead compute |
Beta Was this translation helpful? Give feedback.
-
Thanks for making an amazing resource! I just noticed two small typos:
To "4.8 years"
To "1024 sequences of length 4096 per batch" from the Llama3 paper |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Training LLAMA on TPUs!
Beta Was this translation helpful? Give feedback.
All reactions