[DSV3] Apply TP on DSV3 #1341

wwwjn · 2025-06-26T00:48:05Z

Context

Mostly adapted from llama4, change the TP plan based on the difference between deepseek-v3 and llama.

Thanks @tianyu-l for the detailed walk through about deepseek-v3 attention model and TP plan! This diff is currently based on #1324 , and we want to extract the MoE model in DSV3 and llama4 in a shared place.

Now we have:

FSDP
Activation Checkpointing
TP
CP in progress (hang due to some reason)

Next Step:

Make CP work

Verification

There are minor issue with the numerical verification: With deterministic seed, the loss is not identical. I used AdamW optimizer.

FSDP degree=4 (blue line)
FSDP degree=4, TP degree = 2 (orange line)

With Adam optimizer, the loss is exactly the same:

tianyu-l

Could you verify numerical correctness by comparing e.g. FSDP 2 vs. FSDP 2 + TP 2, using the same seed checkpoint?

H-Huang

LGTM!

torchtitan/models/deepseek_v3/model/model.py

torchtitan/models/deepseek_v3/model/moe.py

torchtitan/models/deepseek_v3/model/model.py

torchtitan/models/deepseek_v3/model/moe.py

wwwjn · 2025-07-02T20:34:47Z

Note for reviewer: Because this diff is rebased onto #1324, there are bunch of files are being changed. Please focus on changes under torchtitan/model/deepseek_v3/, and all the changes needed for TP is under this folder.

@tianyu-l

## Context Mostly adapted from llama4, change the TP plan based on the difference between deepseek-v3 and llama. Thanks @tianyu-l for the detailed walk through about deepseek-v3 attention model and TP plan! This diff is currently based on pytorch#1324 , and we want to extract the MoE model in DSV3 and llama4 in a shared place. Now we have: 1. FSDP 2. Activation Checkpointing 3. TP 4. CP in progress (hang due to some reason) ## Next Step: 1. Make CP work ## Verification There are minor issue with the numerical verification: With deterministic seed, the loss is not identical. I used `AdamW` optimizer. 1. FSDP degree=4 (blue line) 2. FSDP degree=4, TP degree = 2 (orange line) <img width="1368" alt="Screenshot 2025-07-01 at 5 38 50 PM" src="https://github.com/user-attachments/assets/38d96d75-6868-4482-a603-b9e10c692ed9" /> With `Adam` optimizer, the loss is **exactly the same**: <img width="1368" alt="Screenshot 2025-07-02 at 1 26 32 PM" src="https://github.com/user-attachments/assets/6b501d3c-4841-42b1-95fd-3971b16a5eeb" /> --------- Co-authored-by: Tianyu Liu <lty@fb.com>

@tianyu-l

Mostly adapted from llama4, change the TP plan based on the difference between deepseek-v3 and llama. Thanks @tianyu-l for the detailed walk through about deepseek-v3 attention model and TP plan! This diff is currently based on #1324 , and we want to extract the MoE model in DSV3 and llama4 in a shared place. Now we have: 1. FSDP 2. Activation Checkpointing 3. TP 4. CP in progress (hang due to some reason) 1. Make CP work There are minor issue with the numerical verification: With deterministic seed, the loss is not identical. I used `AdamW` optimizer. 1. FSDP degree=4 (blue line) 2. FSDP degree=4, TP degree = 2 (orange line) <img width="1368" alt="Screenshot 2025-07-01 at 5 38 50 PM" src="https://github.com/user-attachments/assets/38d96d75-6868-4482-a603-b9e10c692ed9" /> With `Adam` optimizer, the loss is **exactly the same**: <img width="1368" alt="Screenshot 2025-07-02 at 1 26 32 PM" src="https://github.com/user-attachments/assets/6b501d3c-4841-42b1-95fd-3971b16a5eeb" /> --------- Co-authored-by: Tianyu Liu <lty@fb.com>

@tianyu-l

Mostly adapted from llama4, change the TP plan based on the difference between deepseek-v3 and llama. Thanks @tianyu-l for the detailed walk through about deepseek-v3 attention model and TP plan! This diff is currently based on #1324 , and we want to extract the MoE model in DSV3 and llama4 in a shared place. Now we have: 1. FSDP 2. Activation Checkpointing 3. TP 4. CP in progress (hang due to some reason) 1. Make CP work There are minor issue with the numerical verification: With deterministic seed, the loss is not identical. I used `AdamW` optimizer. 1. FSDP degree=4 (blue line) 2. FSDP degree=4, TP degree = 2 (orange line) <img width="1368" alt="Screenshot 2025-07-01 at 5 38 50 PM" src="https://github.com/user-attachments/assets/38d96d75-6868-4482-a603-b9e10c692ed9" /> With `Adam` optimizer, the loss is **exactly the same**: <img width="1368" alt="Screenshot 2025-07-02 at 1 26 32 PM" src="https://github.com/user-attachments/assets/6b501d3c-4841-42b1-95fd-3971b16a5eeb" /> --------- Co-authored-by: Tianyu Liu <lty@fb.com>

wwwjn requested review from H-Huang and tianyu-l June 26, 2025 00:48

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jun 26, 2025

wwwjn requested a review from fegin June 26, 2025 00:49

tianyu-l approved these changes Jun 26, 2025

View reviewed changes

H-Huang approved these changes Jun 26, 2025

View reviewed changes

wwwjn force-pushed the dsv3-tp branch from 180cc48 to 0275957 Compare June 27, 2025 23:17

wwwjn changed the title ~~[DSV3] Apply TP on DSV3~~ [WIP][DSV3] Apply TP on DSV3 Jun 29, 2025

fegin reviewed Jun 30, 2025

View reviewed changes

torchtitan/models/deepseek_v3/model/model.py Show resolved Hide resolved

torchtitan/models/deepseek_v3/model/moe.py Outdated Show resolved Hide resolved

wwwjn commented Jul 1, 2025

View reviewed changes

torchtitan/models/deepseek_v3/model/model.py Outdated Show resolved Hide resolved

torchtitan/models/deepseek_v3/model/model.py Show resolved Hide resolved

torchtitan/models/deepseek_v3/model/moe.py Outdated Show resolved Hide resolved

wwwjn force-pushed the deepseek-v3 branch from 671c30c to 50e9230 Compare July 1, 2025 23:06

wwwjn force-pushed the dsv3-tp branch from 1b8f610 to e119300 Compare July 1, 2025 23:32

wwwjn force-pushed the deepseek-v3 branch from 50e9230 to d9dbb5b Compare July 2, 2025 17:09

tianyu-l and others added 10 commits July 2, 2025 11:17

dp2ep Expert Parallel

c4e1798

add tp test

70ad37b

add TP for norm

9a4747f

add TP v1

45649de

fix bug

850ddad

test

9ae2eab

tp on groupped_mm finished

9306d80

TP gemm works

6ceff83

rebase onto #1324

9785b50

clean

8396a61

wwwjn force-pushed the dsv3-tp branch from e119300 to 8396a61 Compare July 2, 2025 18:18

clean up

90cf81e

wwwjn changed the title ~~[WIP][DSV3] Apply TP on DSV3~~ [DSV3] Apply TP on DSV3 Jul 2, 2025

lint

8b95525

wwwjn merged commit e4d4031 into deepseek-v3 Jul 2, 2025
3 of 4 checks passed

wwwjn deleted the dsv3-tp branch July 2, 2025 21:24

tianyu-l mentioned this pull request Jul 9, 2025

[DSV3] Adding deepseek-v3 model into torchtitan #1373

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[DSV3] Apply TP on DSV3 #1341

[DSV3] Apply TP on DSV3 #1341

Uh oh!

wwwjn commented Jun 26, 2025 •

edited

Loading

Uh oh!

tianyu-l left a comment

Uh oh!

H-Huang left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wwwjn commented Jul 2, 2025

Uh oh!

Uh oh!

Uh oh!

[DSV3] Apply TP on DSV3 #1341

[DSV3] Apply TP on DSV3 #1341

Uh oh!

Conversation

wwwjn commented Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Context

Next Step:

Verification

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

H-Huang left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wwwjn commented Jul 2, 2025

Uh oh!

Uh oh!

Uh oh!

wwwjn commented Jun 26, 2025 •

edited

Loading