Grad Norm Differences Across Nodes

Continuing the discussion from #2172 (thanks @mirceamironenco, @ebsmothers  for the fix!).

We have a run on the exact same dataset / hparams except we change the number of nodes from 8->2->1. We noticed that when we reduce the number of nodes the gradient norm goes up:

Here is an 8 node run:
<img width="542" alt="Screenshot 2025-01-09 at 12 06 30 PM" src="https://github.com/user-attachments/assets/5991a016-a602-4ad6-97d1-2c9bbedf4b09" />

Here is a 2 node run:
<img width="543" alt="Screenshot 2025-01-09 at 12 05 41 PM" src="https://github.com/user-attachments/assets/04b4ac8c-f37a-4afa-ad2e-cebb640ded31" />

Here is a 1 node run:
<img width="544" alt="Screenshot 2025-01-09 at 12 05 48 PM" src="https://github.com/user-attachments/assets/eaabef34-0c86-4e9d-8691-318f2a037c22" />

We can see the grad norm at initialization is ~4x different between 8 node and 1 node run. With the fix in #2172, I would expect the grad norms to be similar regardless of the world size. The only difference between the runs is the global batch size (64 on 1 node, 512 on 8 nodes), but I would not expect this to cause such a big difference.

Is it possible there are still some issues in how we compute / scale the gradients?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Grad Norm Differences Across Nodes #2240

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Grad Norm Differences Across Nodes #2240

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions