How do you specify to use ZeRO-2, DDP, or ZeRO-3 (FSDP) in torchtune? #2872

Sinestro38 · 2025-07-08T03:07:25Z

Sinestro38
Jul 8, 2025

Hi! I'm tuning llama3_1/8B_full, and trying to benchmark speeds on an 8xH100 instance for various parallelization strategies (DDP, ZeRO-2, ZeRO-3).

I looked through this site: https://docs.pytorch.org/docs/stable/fsdp.html . I understand that in PyTorch semantics, FSDP sharding_strategy = NO_SHARD means DDP, = SHARD_GRAD_OP means ZeRO-2, and = FULL_SHARD is essentially ZeRO-3.

However, when I specify these in the yaml under

fsdp:
  cpu_offload: false
  sharding_strategy: FULL_SHARD   # tried with NO_SHARD or SHARD_GRAD_OP too
  backward_prefetch: BACKWARD_PRE
  mixed_precision: bf16

I notice that nothing changes? The time/iteration and the mem utilization per GPU seems essentially the same between these 3 different strategies. So I must not be setting them properly, how do you switch between these on torchtune?

Sinestro38 · 2025-07-17T19:03:51Z

Sinestro38
Jul 17, 2025
Author

Perhaps you might know? cc: @krammnic :)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How do you specify to use ZeRO-2, DDP, or ZeRO-3 (FSDP) in torchtune? #2872

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

How do you specify to use ZeRO-2, DDP, or ZeRO-3 (FSDP) in torchtune? #2872

Uh oh!

Uh oh!

Sinestro38 Jul 8, 2025

Replies: 1 comment

Uh oh!

Sinestro38 Jul 17, 2025 Author

Sinestro38
Jul 8, 2025

Sinestro38
Jul 17, 2025
Author