resume checkpoint w/o deepspeed #1662

tangjiasheng · 2021-12-27T12:00:54Z

tangjiasheng
Dec 27, 2021

So if I refactor my code with deepspeed, is there any sample codes about "load" and "save" ckpt with and without deepspeed?
To give a specific construction:
Suppose the original pytorch code is bult with timm trainer. After rewrite code with ds, e.g.

if args.deepspeed:
        model, optimizer, _, _ = deepspeed.initialize(args=args, model=model, optimizer=optimizer)

and so for lines within train_one_epoch, here are some detailed questions I want to know:

Do I need to change the checkpoint saving codes to use deepspeed.save_checkpoint?
If not, how to resume from the saved ckpt(saved with raw pytorch)? especially for optimizer and loss_scaler.
If so, like the question from above, how to use deepspeed.load_checkpoint to resume the ds format ckpt?
Thanks a lot for answering my questions.

tangjiasheng · 2021-12-27T12:55:29Z

tangjiasheng
Dec 27, 2021
Author

Update:
I tried to add the following codes right after the lr_scheduler is created:

if args.deepspeed:
    model, optimizer, _, _ = deepspeed.initialize(args=args, model=model, optimizer=optimizer)
    if args.resume:
        ckpt = torch.load(args.resume)
        optimizer = optimizer.load_state_dict(ckpt['optimizer'])
        # I think it's no need to resume loss_scaler from ckpt, right?

and set None on L451/452 for optimizer and loss_scaler in order to keep consistency with timm. Thus, I think the model weights are resumed right at L450 and the remaining parts (maybe the only part?) to be resumed is the optimizer.
It seems a little right with this example, but failed with this line for asking for [0] indices.
Then I slightly add a pair of '[]' to ckpt['optimizer']--although it's quite wired. Then it raised an error on the tensor dimension unmatch.

RuntimeError: The size of tensor a (63056) must match the size of tensor b (1971) at non-singleton dimension 0

Where 63056/1971 ~= 32. And in my resumed ckpt, I just used 32 cards for training with ds. Thus, I can just infer that above usage on resuming from a checkpoint file is right and what I need to change is how to refactor the saving code with ds, right?

And going further, it points out that I need to change the saver operation in timm to save_checkpoint in deepspeed.
If my previous guess are right, the final question is:
Is this save_checkpoint op in ds safe for share file system? I know the load op in ds works fine for share file system.

Please correct me if there is any problem of all previous analysis!

0 replies

tjruwase · 2022-01-03T19:38:30Z

tjruwase
Jan 3, 2022
Maintainer

@tangjiasheng, thanks for your question. I don't fully understand all parts of your question, so I will address just some points.

Yes you need to use deepspeed.save_checkpoint() in order to load with deepspeed.load_checkpoint().
Yes, deepspeed.save_checkpoint() works on shared file systems like nfs.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

resume checkpoint w/o deepspeed #1662

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

resume checkpoint w/o deepspeed #1662

Uh oh!

Uh oh!

tangjiasheng Dec 27, 2021

Replies: 2 comments

Uh oh!

Uh oh!

tangjiasheng Dec 27, 2021 Author

Uh oh!

tjruwase Jan 3, 2022 Maintainer

tangjiasheng
Dec 27, 2021

tangjiasheng
Dec 27, 2021
Author

tjruwase
Jan 3, 2022
Maintainer