-
Notifications
You must be signed in to change notification settings - Fork 43
Description
I want to train the network --arch 7 with my custom 62k dataset that is similar to DUTS. I am using 48GB CUDA and batch size 8. After a few iteration, I am getting the following error
Traceback (most recent call last):
File "main.py", line 55, in
main(args)
File "main.py", line 35, in main
Trainer(args, save_path)
File "/root/TRACER/trainer.py", line 58, in init
train_loss, train_mae = self.training(args)
File "/root/TRACER/trainer.py", line 117, in training
loss.backward()
File "/root/TRACER/venv/lib/python3.6/site-packages/torch/_tensor.py", line 307, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/root/TRACER/venv/lib/python3.6/site-packages/torch/autograd/init.py", line 156, in backward
allow_unreachable=True, accumulate_grad=True) # allow_unreachable flag
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.