如何加载模型参数或者继续训练模型(使用trainer 混合并行的方式训练的vit模型) #3721
Unanswered
stonewjf
asked this question in
Community | Q&A
Replies: 1 comment 1 reply
-
Hi @stonewjf What code are you using? How can we reproduce your issue? |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
根据教程中的实例使用下面代码load参数报错
from colossalai.utils import load_checkpoint load_checkpoint('./checkpoints/checkpoint0002.pth', model, optimizer, lr_scheduler)
错误如下:
Traceback (most recent call last): File "train_with_trainer.py", line 143, in <module> load_checkpoint('./checkpoints/checkpoint0002.pth', model, optimizer, lr_scheduler) File "/home/haida_huanglei/anaconda3/envs/colossalai/lib/python3.8/site-packages/colossalai/utils/checkpointing.py", line 234, in load_checkpoint train_imagenet() File "train_with_trainer.py", line 96, in train_imagenet model_state = partition_pipeline_parallel_state_dict(model, model_state) File "/home/haida_huanglei/anaconda3/envs/colossalai/lib/python3.8/site-packages/colossalai/utils/checkpointing.py", line 133, in partition_pipeline_parallel_state_dict _send_state_dict(state_dict, gpc.get_next_global_rank(ParallelMode.PIPELINE), ParallelMode.PIPELINE) File "/home/haida_huanglei/anaconda3/envs/colossalai/lib/python3.8/site-packages/colossalai/utils/checkpointing.py", line 99, in _send_state_dict load_checkpoint('./checkpoints/checkpoint0002.pth', model, optimizer, lr_scheduler) File "/home/haida_huanglei/anaconda3/envs/colossalai/lib/python3.8/site-packages/colossalai/utils/checkpointing.py", line 234, in load_checkpoint state_tensor, state_size = dist.distributed_c10d._object_to_tensor(state_dict) TypeError: _object_to_tensor() missing 1 required positional argument: 'device'
Beta Was this translation helpful? Give feedback.
All reactions