How do you specify to use ZeRO-2, DDP, or ZeRO-3 (FSDP) in torchtune? #2872
Unanswered
Sinestro38
asked this question in
Q&A
Replies: 1 comment
-
Perhaps you might know? cc: @krammnic :) |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi! I'm tuning
llama3_1/8B_full
, and trying to benchmark speeds on an 8xH100 instance for various parallelization strategies (DDP, ZeRO-2, ZeRO-3).I looked through this site: https://docs.pytorch.org/docs/stable/fsdp.html . I understand that in PyTorch semantics, FSDP
sharding_strategy
=NO_SHARD
means DDP,= SHARD_GRAD_OP
means ZeRO-2, and= FULL_SHARD
is essentially ZeRO-3.However, when I specify these in the yaml under
I notice that nothing changes? The time/iteration and the mem utilization per GPU seems essentially the same between these 3 different strategies. So I must not be setting them properly, how do you switch between these on torchtune?
Beta Was this translation helpful? Give feedback.
All reactions