-
Notifications
You must be signed in to change notification settings - Fork 13
Description
Any idea what might cause this to happen after several successful steps of training?
[rank4]: File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/transformers/trainer.py", line 3791, in training_step
[rank4]: self.accelerator.backward(loss, **kwargs)
[rank4]: File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/accelerate/accelerator.py", line 2473, in backward
[rank4]: loss.backward(**kwargs)
[rank4]: File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/torch/_tensor.py", line 648, in backward
[rank4]: torch.autograd.backward(
[rank4]: File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/torch/autograd/init.py", line 353, in backward
[rank4]: _engine_run_backward(
[rank4]: File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/torch/autograd/graph.py", line 824, in _engine_run_backward
[rank4]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
[rank4]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank4]: File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/torch/autograd/function.py", line 307, in apply
[rank4]: return user_fn(self, *args)
[rank4]: ^^^^^^^^^^^^^^^^^^^^
[rank4]: File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/native_sparse_attention/ops/triton/topk_sparse_attention.py", line 1164, in backward
[rank4]: dq, dk, dv = _topk_sparse_attention_bwd( [rank4]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank4]: File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/native_sparse_attention/ops/triton/topk_sparse_attention.py", line 997, in _topk_sparse_attention_bwd
[rank4]: backward_dkdv[grid](
[rank4]: File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/triton/runtime/jit.py", line 347, in
[rank4]: return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
[rank4]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank4]: File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/triton/runtime/jit.py", line 591, in run
[rank4]: kernel.run(grid_0, grid_1, grid_2, stream, kernel.function, kernel.packed_metadata,
[rank4]: File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/triton/backends/nvidia/driver.py", line 529, in call
[rank4]: self.launch(gridX, gridY, gridZ, stream, function, self.launch_cooperative_grid, global_scratch, *args)
[rank4]: RuntimeError: Triton Error [CUDA]: an illegal memory access was encountered