You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Mostly adapted from llama4, change the TP plan based on the difference
between deepseek-v3 and llama.
Thanks @tianyu-l for the detailed walk through about deepseek-v3
attention model and TP plan! This diff is currently based on #1324 , and
we want to extract the MoE model in DSV3 and llama4 in a shared place.
Now we have:
1. FSDP
2. Activation Checkpointing
3. TP
4. CP in progress (hang due to some reason)
1. Make CP work
There are minor issue with the numerical verification: With
deterministic seed, the loss is not identical. I used `AdamW` optimizer.
1. FSDP degree=4 (blue line)
2. FSDP degree=4, TP degree = 2 (orange line)
<img width="1368" alt="Screenshot 2025-07-01 at 5 38 50 PM"
src="https://github.com/user-attachments/assets/38d96d75-6868-4482-a603-b9e10c692ed9"
/>
With `Adam` optimizer, the loss is **exactly the same**:
<img width="1368" alt="Screenshot 2025-07-02 at 1 26 32 PM"
src="https://github.com/user-attachments/assets/6b501d3c-4841-42b1-95fd-3971b16a5eeb"
/>
---------
Co-authored-by: Tianyu Liu <lty@fb.com>
0 commit comments