Skip to content

[DSV3] Apply TP on DSV3 #1341

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Jul 2, 2025
Merged

[DSV3] Apply TP on DSV3 #1341

merged 12 commits into from
Jul 2, 2025

Conversation

wwwjn
Copy link
Contributor

@wwwjn wwwjn commented Jun 26, 2025

Context

Mostly adapted from llama4, change the TP plan based on the difference between deepseek-v3 and llama.

Thanks @tianyu-l for the detailed walk through about deepseek-v3 attention model and TP plan! This diff is currently based on #1324 , and we want to extract the MoE model in DSV3 and llama4 in a shared place.

Now we have:

  1. FSDP
  2. Activation Checkpointing
  3. TP
  4. CP in progress (hang due to some reason)

Next Step:

  1. Make CP work

Verification

There are minor issue with the numerical verification: With deterministic seed, the loss is not identical. I used AdamW optimizer.

  1. FSDP degree=4 (blue line)
  2. FSDP degree=4, TP degree = 2 (orange line)
Screenshot 2025-07-01 at 5 38 50 PM

With Adam optimizer, the loss is exactly the same:
Screenshot 2025-07-02 at 1 26 32 PM

@wwwjn wwwjn requested review from H-Huang and tianyu-l June 26, 2025 00:48
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jun 26, 2025
@wwwjn wwwjn requested a review from fegin June 26, 2025 00:49
Copy link
Contributor

@tianyu-l tianyu-l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you verify numerical correctness by comparing e.g. FSDP 2 vs. FSDP 2 + TP 2, using the same seed checkpoint?

Copy link
Member

@H-Huang H-Huang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@wwwjn wwwjn changed the title [DSV3] Apply TP on DSV3 [WIP][DSV3] Apply TP on DSV3 Jun 29, 2025
@wwwjn wwwjn changed the title [WIP][DSV3] Apply TP on DSV3 [DSV3] Apply TP on DSV3 Jul 2, 2025
@wwwjn
Copy link
Contributor Author

wwwjn commented Jul 2, 2025

Note for reviewer: Because this diff is rebased onto #1324, there are bunch of files are being changed. Please focus on changes under torchtitan/model/deepseek_v3/, and all the changes needed for TP is under this folder.

@wwwjn wwwjn merged commit e4d4031 into deepseek-v3 Jul 2, 2025
3 of 4 checks passed
@wwwjn wwwjn deleted the dsv3-tp branch July 2, 2025 21:24
H-Huang pushed a commit to H-Huang/torchtitan that referenced this pull request Jul 8, 2025
## Context
Mostly adapted from llama4, change the TP plan based on the difference
between deepseek-v3 and llama.

Thanks @tianyu-l for the detailed walk through about deepseek-v3
attention model and TP plan! This diff is currently based on pytorch#1324 , and
we want to extract the MoE model in DSV3 and llama4 in a shared place.

Now we have:
1. FSDP
2. Activation Checkpointing
3. TP
4. CP in progress (hang due to some reason)

## Next Step:
1. Make CP work

## Verification
There are minor issue with the numerical verification: With
deterministic seed, the loss is not identical. I used `AdamW` optimizer.

1. FSDP degree=4 (blue line)
2. FSDP degree=4, TP degree = 2 (orange line)

<img width="1368" alt="Screenshot 2025-07-01 at 5 38 50 PM"
src="https://github.com/user-attachments/assets/38d96d75-6868-4482-a603-b9e10c692ed9"
/>


With `Adam` optimizer, the loss is **exactly the same**:
<img width="1368" alt="Screenshot 2025-07-02 at 1 26 32 PM"
src="https://github.com/user-attachments/assets/6b501d3c-4841-42b1-95fd-3971b16a5eeb"
/>

---------

Co-authored-by: Tianyu Liu <lty@fb.com>
wwwjn added a commit that referenced this pull request Jul 8, 2025
Mostly adapted from llama4, change the TP plan based on the difference
between deepseek-v3 and llama.

Thanks @tianyu-l for the detailed walk through about deepseek-v3
attention model and TP plan! This diff is currently based on #1324 , and
we want to extract the MoE model in DSV3 and llama4 in a shared place.

Now we have:
1. FSDP
2. Activation Checkpointing
3. TP
4. CP in progress (hang due to some reason)

1. Make CP work

There are minor issue with the numerical verification: With
deterministic seed, the loss is not identical. I used `AdamW` optimizer.

1. FSDP degree=4 (blue line)
2. FSDP degree=4, TP degree = 2 (orange line)

<img width="1368" alt="Screenshot 2025-07-01 at 5 38 50 PM"
src="https://github.com/user-attachments/assets/38d96d75-6868-4482-a603-b9e10c692ed9"
/>

With `Adam` optimizer, the loss is **exactly the same**:
<img width="1368" alt="Screenshot 2025-07-02 at 1 26 32 PM"
src="https://github.com/user-attachments/assets/6b501d3c-4841-42b1-95fd-3971b16a5eeb"
/>

---------

Co-authored-by: Tianyu Liu <lty@fb.com>
wwwjn added a commit that referenced this pull request Jul 10, 2025
Mostly adapted from llama4, change the TP plan based on the difference
between deepseek-v3 and llama.

Thanks @tianyu-l for the detailed walk through about deepseek-v3
attention model and TP plan! This diff is currently based on #1324 , and
we want to extract the MoE model in DSV3 and llama4 in a shared place.

Now we have:
1. FSDP
2. Activation Checkpointing
3. TP
4. CP in progress (hang due to some reason)

1. Make CP work

There are minor issue with the numerical verification: With
deterministic seed, the loss is not identical. I used `AdamW` optimizer.

1. FSDP degree=4 (blue line)
2. FSDP degree=4, TP degree = 2 (orange line)

<img width="1368" alt="Screenshot 2025-07-01 at 5 38 50 PM"
src="https://github.com/user-attachments/assets/38d96d75-6868-4482-a603-b9e10c692ed9"
/>

With `Adam` optimizer, the loss is **exactly the same**:
<img width="1368" alt="Screenshot 2025-07-02 at 1 26 32 PM"
src="https://github.com/user-attachments/assets/6b501d3c-4841-42b1-95fd-3971b16a5eeb"
/>

---------

Co-authored-by: Tianyu Liu <lty@fb.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants