refactor ParallelDims and CheckpointManager #1384

tianyu-l · 2025-07-12T01:40:19Z

This PR does the following:

move world_mesh into ParallelDims, as they have a close relationship
move enable_loss_parallel out of ParallelDims constructor
add a convenient property seq_len_divisor to ParallelDims
set dataloader and ft_manager as optional in CheckpointManager
some minor improvements on typing and code organization

tianyu-l · 2025-07-12T01:41:09Z

tianyu-l · 2025-07-12T05:18:20Z

torchtitan/components/checkpoint.py

@@ -180,17 +180,19 @@ class CheckpointManager:

    def __init__(
        self,
-        dataloader: DataLoader,
+        dataloader: BaseDataLoader | None,


@ebsmothers
I kept this field to be required and value to be optional -- the code still works. I didn't make it completely optional with a None default because that would require more if-else in this file.

I think it won't look too bad when I specify dataloader=None in forge_engine.py. Let me know if it's ok to you.

Thanks for the heads up, I think this is a reasonable compromise

wwwjn

LGTM!

wwwjn · 2025-07-14T18:41:10Z

torchtitan/models/deepseek_v3/infra/parallelize.py

@@ -54,7 +55,7 @@ def parallelize_deepseekv3(
        apply_non_moe_tp(


Need to add the same check job_config.training.seq_len % parallel_dims.seq_len_divisor for TP here. You could add it or I could add it in next PR

OK I can add them

wwwjn · 2025-07-14T18:46:45Z

scripts/generate/test_generate.py

        )
-        # Build world mesh for parallelism
-        world_mesh = parallel_dims.build_mesh(device_type=device_type)
+        world_mesh = parallel_dims.world_mesh


nit: We don't need this line here to "build world mesh" explicitly, right? In line 134, parallel_dims.world_mesh["tp"] will call build_mesh internally.

It will be used in line 136 to set determinism.

wconstab · 2025-07-14T19:11:34Z

torchtitan/train.py

@@ -457,7 +454,9 @@ def train_step(
            [p for m in self.model_parts for p in m.parameters()],
            self.job_config.training.max_norm,
            foreach=True,
-            pp_mesh=self.world_mesh["pp"] if parallel_dims.pp_enabled else None,
+            pp_mesh=(
+                parallel_dims.world_mesh["pp"] if parallel_dims.pp_enabled else None


nit: getting a bit inconsistent about whether to pass parallel_dims or pass a mesh obj. I think it is not a big deal though

That's good catch!
Passing in ParallelDims only seems enough, but I have the concern that it will break BC as some users use this function as a standalone util -- I think we can change it to not pass in ParallelDims.

tianyu-l requested review from fegin, wwwjn and wconstab as code owners July 12, 2025 01:40

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jul 12, 2025

tianyu-l force-pushed the refactor branch from 2da63d9 to c7601a7 Compare July 12, 2025 05:13

tianyu-l commented Jul 12, 2025

View reviewed changes

tianyu-l force-pushed the refactor branch 2 times, most recently from 668ee1e to 0b4cad7 Compare July 13, 2025 02:56

tianyu-l mentioned this pull request Jul 13, 2025

add the forge folder #1387

Open

tianyu-l requested a review from ebsmothers July 13, 2025 07:21

wwwjn approved these changes Jul 14, 2025

View reviewed changes

wconstab reviewed Jul 14, 2025

View reviewed changes

wconstab approved these changes Jul 14, 2025

View reviewed changes

tianyu-l force-pushed the refactor branch from 0b4cad7 to 1374ec4 Compare July 14, 2025 20:04

refactor ParallelDims and CheckpointManager

9acae00

tianyu-l force-pushed the refactor branch from 1374ec4 to 9acae00 Compare July 14, 2025 20:11

tianyu-l merged commit 6204cdf into main Jul 14, 2025
10 checks passed

tianyu-l deleted the refactor branch July 14, 2025 20:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

refactor ParallelDims and CheckpointManager #1384

refactor ParallelDims and CheckpointManager #1384

tianyu-l commented Jul 12, 2025 •

edited

Loading

Uh oh!

tianyu-l commented Jul 12, 2025

Uh oh!

tianyu-l Jul 12, 2025

Uh oh!

ebsmothers Jul 14, 2025

Uh oh!

wwwjn left a comment

Uh oh!

wwwjn Jul 14, 2025

Uh oh!

tianyu-l Jul 14, 2025

Uh oh!

wwwjn Jul 14, 2025

Uh oh!

tianyu-l Jul 14, 2025

Uh oh!

wconstab Jul 14, 2025

Uh oh!

tianyu-l Jul 14, 2025

Uh oh!

Uh oh!

Uh oh!

		@@ -54,7 +55,7 @@ def parallelize_deepseekv3(
		apply_non_moe_tp(

refactor ParallelDims and CheckpointManager #1384

refactor ParallelDims and CheckpointManager #1384

Conversation

tianyu-l commented Jul 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tianyu-l commented Jul 12, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wwwjn left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

tianyu-l commented Jul 12, 2025 •

edited

Loading