Error after few iteration on training

I want to train the network   --arch 7 with my custom 62k dataset that is similar to DUTS. I am using 48GB CUDA and batch size 8. After a few iteration, I am getting the following error 
Traceback (most recent call last):
  File "main.py", line 55, in <module>
    main(args)
  File "main.py", line 35, in main
    Trainer(args, save_path)
  File "/root/TRACER/trainer.py", line 58, in __init__
    train_loss, train_mae = self.training(args)
  File "/root/TRACER/trainer.py", line 117, in training
    loss.backward()
  File "/root/TRACER/venv/lib/python3.6/site-packages/torch/_tensor.py", line 307, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/root/TRACER/venv/lib/python3.6/site-packages/torch/autograd/__init__.py", line 156, in backward
    allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
![Screenshot from 2023-02-17 06-17-54](https://user-images.githubusercontent.com/1842237/219521671-f06db3fb-533d-47f6-9dde-03345d4a17ae.png)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Error after few iteration on training #34

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Error after few iteration on training #34

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions