Skip to content

训练rtdert报错 #27

@iodncookie

Description

@iodncookie

python tools/train.py -f exps/rtdetr/rtdetr_r18vd_6x_coco.py -d 2 -b 20 -eb 24 -w 4 -ew 4 -lrs 0.1
报错如下:

2023-08-22 17:46:14 | INFO | mmdet.core.trainer:493 - ---> start train epoch1
2023-08-22 17:46:16 | ERROR | mmdet.core.trainer:98 - one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor []] is at version 1; expected version 0 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).
2023-08-22 17:46:16 | INFO | mmdet.core.trainer:343 - Training of experiment is done and the best AP is 0.00
2023-08-22 17:46:16 | ERROR | mmdet.core.launch:147 - An error has been caught in function '_distributed_worker', process 'SpawnProcess-1' (478), thread 'MainThread' (139673154561728):
Traceback (most recent call last):
File "", line 1, in
File "/data/anaconda3/envs/miemie_det/lib/python3.8/multiprocessing/spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
│ │ └ 5
│ └ 8
└ <function _main at 0x7f083058cc10>
File "/data/anaconda3/envs/miemie_det/lib/python3.8/multiprocessing/spawn.py", line 129, in _main
return self._bootstrap(parent_sentinel)
│ │ └ 5
│ └ <function BaseProcess._bootstrap at 0x7f083073dee0>

File "/data/anaconda3/envs/miemie_det/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
self.run()
│ └ <function BaseProcess.run at 0x7f083073d550>

File "/data/anaconda3/envs/miemie_det/lib/python3.8/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
│ │ │ │ │ └ {}
│ │ │ │ └
│ │ │ └ (<function _distributed_worker at 0x7f07b1e38160>, 0, (<function main at 0x7f077b30d940>, 2, 2, 0, 'nccl', 'tcp://127.0.0.1:5...
│ │ └
│ └ <function _wrap at 0x7f07b1956310>

File "/data/anaconda3/envs/miemie_det/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
│ │ └ (<function main at 0x7f077b30d940>, 2, 2, 0, 'nccl', 'tcp://127.0.0.1:56017', (╒═══════════════════════╤═════════════════════...
│ └ 0
└ <function _distributed_worker at 0x7f07b1e38160>
File "/home/a-bamboo/repositories/miemiedetection/mmdet/core/launch.py", line 147, in _distributed_worker
main_func(*args)
│ └ (╒═══════════════════════╤═══════════════════════════════════════════════════════════════════════════════════════════════════...
└ <function main at 0x7f077b30d940>
File "/home/a-bamboo/repositories/miemiedetection/tools/train.py", line 126, in main
trainer.train()
│ └ <function Trainer.train at 0x7f077a68bc10>
└ <mmdet.core.trainer.Trainer object at 0x7f077a65ce20>
File "/home/a-bamboo/repositories/miemiedetection/mmdet/core/trainer.py", line 96, in train
self.train_in_epoch()
│ └ <function Trainer.train_in_epoch at 0x7f077a68bd30>
└ <mmdet.core.trainer.Trainer object at 0x7f077a65ce20>
File "/home/a-bamboo/repositories/miemiedetection/mmdet/core/trainer.py", line 336, in train_in_epoch
self.train_in_iter()
│ └ <function Trainer.train_in_iter at 0x7f077a68be50>
└ <mmdet.core.trainer.Trainer object at 0x7f077a65ce20>
File "/home/a-bamboo/repositories/miemiedetection/mmdet/core/trainer.py", line 350, in train_in_iter
self.train_one_iter()
│ └ <function Trainer.train_one_iter at 0x7f077a68bee0>
└ <mmdet.core.trainer.Trainer object at 0x7f077a65ce20>
File "/home/a-bamboo/repositories/miemiedetection/mmdet/core/trainer.py", line 462, in train_one_iter
self.scaler.scale(loss).backward()
│ │ │ └ tensor(13467.0713, device='cuda:0', grad_fn=)
│ │ └ <function GradScaler.scale at 0x7f07b2133790>
│ └ <torch.cuda.amp.grad_scaler.GradScaler object at 0x7f077a65ce50>
└ <mmdet.core.trainer.Trainer object at 0x7f077a65ce20>
File "/data/anaconda3/envs/miemie_det/lib/python3.8/site-packages/torch/_tensor.py", line 363, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
│ │ │ │ │ │ │ └ None
│ │ │ │ │ │ └ False
│ │ │ │ │ └ None
│ │ │ │ └ None
│ │ │ └ tensor(13467.0713, device='cuda:0', grad_fn=)
│ │ └ <function backward at 0x7f07b1d6cee0>
│ └ <module 'torch.autograd' from '/data/anaconda3/envs/miemie_det/lib/python3.8/site-packages/torch/autograd/init.py'>
└ <module 'torch' from '/data/anaconda3/envs/miemie_det/lib/python3.8/site-packages/torch/init.py'>
File "/data/anaconda3/envs/miemie_det/lib/python3.8/site-packages/torch/autograd/init.py", line 173, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
│ │ └ <method 'run_backward' of 'torch._C._EngineBase' objects>
│ └ <torch._C._EngineBase object at 0x7f07be7d8d80>
└ <class 'torch.autograd.variable.Variable'>

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor []] is at version 1; expected version 0 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions