如何加载模型参数或者继续训练模型（使用trainer 混合并行的方式训练的vit模型） #3721

stonewjf · 2023-05-10T03:14:31Z

stonewjf
May 10, 2023

根据教程中的实例使用下面代码load参数报错
from colossalai.utils import load_checkpoint load_checkpoint('./checkpoints/checkpoint0002.pth', model, optimizer, lr_scheduler)
错误如下：
Traceback (most recent call last): File "train_with_trainer.py", line 143, in <module> load_checkpoint('./checkpoints/checkpoint0002.pth', model, optimizer, lr_scheduler) File "/home/haida_huanglei/anaconda3/envs/colossalai/lib/python3.8/site-packages/colossalai/utils/checkpointing.py", line 234, in load_checkpoint train_imagenet() File "train_with_trainer.py", line 96, in train_imagenet model_state = partition_pipeline_parallel_state_dict(model, model_state) File "/home/haida_huanglei/anaconda3/envs/colossalai/lib/python3.8/site-packages/colossalai/utils/checkpointing.py", line 133, in partition_pipeline_parallel_state_dict _send_state_dict(state_dict, gpc.get_next_global_rank(ParallelMode.PIPELINE), ParallelMode.PIPELINE) File "/home/haida_huanglei/anaconda3/envs/colossalai/lib/python3.8/site-packages/colossalai/utils/checkpointing.py", line 99, in _send_state_dict load_checkpoint('./checkpoints/checkpoint0002.pth', model, optimizer, lr_scheduler) File "/home/haida_huanglei/anaconda3/envs/colossalai/lib/python3.8/site-packages/colossalai/utils/checkpointing.py", line 234, in load_checkpoint state_tensor, state_size = dist.distributed_c10d._object_to_tensor(state_dict) TypeError: _object_to_tensor() missing 1 required positional argument: 'device'

binmakeswell · 2023-05-16T08:24:12Z

binmakeswell
May 16, 2023
Maintainer

Hi @stonewjf What code are you using? How can we reproduce your issue？
Maybe you can refer to
https://github.com/hpcaitech/ColossalAI/tree/main/applications/Chat#faq

1 reply

stonewjf May 19, 2023
Author

I am using the code from this link: https://github.com/hpcaitech/ColossalAI-Examples/blob/main/image/vision_transformer/hybrid_parallel/train_with_trainer.py to train a ViT mdoel

When I tried to continue training the model using the code below, I encountered this error.
from colossalai.utils import load_checkpoint load_checkpoint('./checkpoints/checkpoint0002.pth', model, optimizer, lr_scheduler)
I found this code in https://colossalai.org/docs/basics/model_checkpoint/#load

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

如何加载模型参数或者继续训练模型（使用trainer 混合并行的方式训练的vit模型） #3721

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

如何加载模型参数或者继续训练模型（使用trainer 混合并行的方式训练的vit模型） #3721

Uh oh!

stonewjf May 10, 2023

Replies: 1 comment · 1 reply

Uh oh!

binmakeswell May 16, 2023 Maintainer

Uh oh!

stonewjf May 19, 2023 Author

stonewjf
May 10, 2023

Replies: 1 comment 1 reply

binmakeswell
May 16, 2023
Maintainer

stonewjf May 19, 2023
Author