resume checkpoint w/o deepspeed #1662
Replies: 2 comments
-
|
Update: and set Where 63056/1971 ~= 32. And in my resumed ckpt, I just used 32 cards for training with ds. Thus, I can just infer that above usage on resuming from a checkpoint file is right and what I need to change is how to refactor the saving code with ds, right? And going further, it points out that I need to change the saver operation in timm to save_checkpoint in deepspeed. Please correct me if there is any problem of all previous analysis! |
Beta Was this translation helpful? Give feedback.
-
|
@tangjiasheng, thanks for your question. I don't fully understand all parts of your question, so I will address just some points.
|
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
So if I refactor my code with deepspeed, is there any sample codes about "load" and "save" ckpt with and without deepspeed?
To give a specific construction:
Suppose the original pytorch code is bult with timm trainer. After rewrite code with ds, e.g.
and so for lines within train_one_epoch, here are some detailed questions I want to know:
Thanks a lot for answering my questions.
Beta Was this translation helpful? Give feedback.
All reactions