Understanding Internal: RET_CHECK failure
stack traces
#6580
-
In netket we use numba to create Custom jax primitives for things that we cannot write in standard jax. In some cases, this leads to issues, There is a test checking that this feature works that triggers a Can someone help me decypher where the error might be coming from? The trace is the following: 2021-04-28 14:24:17.533804: E external/org_tensorflow/tensorflow/compiler/xla/status_macros.cc:56] Internal: RET_CHECK failure (external/org_tensorflow/tensorflow/compiler/xla/service/gpu/outfeed_thunk.cc:80) ShapeUtil::Equal(source_slices_[index].shape, output_shape) Mismatch between outfeed output buffer shape u32[2]{0} and outfeed source buffer shape s64[]
*** Begin stack trace ***
_PyObject_MakeTpCall
_PyEval_EvalFrameDefault
_PyEval_EvalCodeWithName
_PyFunction_Vectorcall
_PyObject_FastCallDict
PyObject_Call
_PyEval_EvalFrameDefault
_PyEval_EvalCodeWithName
_PyFunction_Vectorcall
PyVectorcall_Call
_PyEval_EvalFrameDefault
_PyEval_EvalFrameDefault
_PyEval_EvalFrameDefault
_PyEval_EvalCodeWithName
_PyFunction_Vectorcall
PyVectorcall_Call
_PyEval_EvalFrameDefault
_PyEval_EvalCodeWithName
_PyFunction_Vectorcall
PyVectorcall_Call
_PyEval_EvalFrameDefault
_PyEval_EvalCodeWithName
_PyFunction_Vectorcall
PyVectorcall_Call
_PyEval_EvalFrameDefault
_PyEval_EvalCodeWithName
_PyFunction_Vectorcall
PyVectorcall_Call
_PyObject_MakeTpCall
PyVectorcall_Call
_PyObject_MakeTpCall
_PyEval_EvalFrameDefault
_PyEval_EvalCodeWithName
_PyFunction_Vectorcall
_PyEval_EvalFrameDefault
_PyEval_EvalCodeWithName
_PyFunction_Vectorcall
PyVectorcall_Call
Py_FinalizeEx
Py_Exit
PyRun_SimpleFileExFlags
Py_BytesMain
__libc_start_main
_start
*** End stack trace ***
2021-04-28 14:24:17.533857: E external/org_tensorflow/tensorflow/compiler/xla/pjrt/pjrt_stream_executor_client.cc:1955] Execution of replica 0 failed: Internal: RET_CHECK failure (external/org_tensorflow/tensorflow/compiler/xla/service/gpu/outfeed_thunk.cc:80) ShapeUtil::Equal(source_slices_[index].shape, output_shape) Mismatch between outfeed output buffer shape u32[2]{0} and outfeed source buffer shape s64[]
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
File "/home/filippovicentini/Documents/pythonenvs/netket_env/lib64/python3.8/site-packages/jax/experimental/host_callback.py", line 1551, in exit_handler
barrier_wait("at_exit")
File "/home/filippovicentini/Documents/pythonenvs/netket_env/lib64/python3.8/site-packages/jax/experimental/host_callback.py", line 1597, in barrier_wait
api.jit(lambda x: id_tap(barrier_tap, x), device=d)(x_on_dev)
jax._src.traceback_util.FilteredStackTrace: RuntimeError: Internal: RET_CHECK failure (external/org_tensorflow/tensorflow/compiler/xla/service/gpu/outfeed_thunk.cc:80) ShapeUtil::Equal(source_slices_[index].shape, output_shape) Mismatch between outfeed output buffer shape u32[2]{0} and outfeed source buffer shape s64[]
The stack trace above excludes JAX-internal frames.
The following is the original exception that occurred, unmodified.
--------------------
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/filippovicentini/Documents/pythonenvs/netket_env/lib64/python3.8/site-packages/jax/_src/traceback_util.py", line 139, in reraise_with_filtered_traceback
return fun(*args, **kwargs)
File "/home/filippovicentini/Documents/pythonenvs/netket_env/lib64/python3.8/site-packages/jax/api.py", line 332, in cache_miss
out_flat = xla.xla_call(
File "/home/filippovicentini/Documents/pythonenvs/netket_env/lib64/python3.8/site-packages/jax/core.py", line 1402, in bind
return call_bind(self, fun, *args, **params)
File "/home/filippovicentini/Documents/pythonenvs/netket_env/lib64/python3.8/site-packages/jax/core.py", line 1393, in call_bind
outs = primitive.process(top_trace, fun, tracers, params)
File "/home/filippovicentini/Documents/pythonenvs/netket_env/lib64/python3.8/site-packages/jax/core.py", line 1405, in process
return trace.process_call(self, fun, tracers, params)
File "/home/filippovicentini/Documents/pythonenvs/netket_env/lib64/python3.8/site-packages/jax/core.py", line 600, in process_call
return primitive.impl(f, *tracers, **params)
File "/home/filippovicentini/Documents/pythonenvs/netket_env/lib64/python3.8/site-packages/jax/interpreters/xla.py", line 579, in _xla_call_impl
return compiled_fun(*args)
File "/home/filippovicentini/Documents/pythonenvs/netket_env/lib64/python3.8/site-packages/jax/interpreters/xla.py", line 830, in _execute_compiled
out_bufs = compiled.execute(input_bufs)
RuntimeError: Internal: RET_CHECK failure (external/org_tensorflow/tensorflow/compiler/xla/service/gpu/outfeed_thunk.cc:80) ShapeUtil::Equal(source_slices_[index].shape, output_shape) Mismatch between outfeed output buffer shape u32[2]{0} and outfeed source buffer shape s64[]
2021-04-28 14:24:17.748355: E external/org_tensorflow/tensorflow/compiler/xla/status_macros.cc:56] Internal: RET_CHECK failure (external/org_tensorflow/tensorflow/compiler/xla/service/gpu/outfeed_thunk.cc:80) ShapeUtil::Equal(source_slices_[index].shape, output_shape) Mismatch between outfeed output buffer shape s64[] and outfeed source buffer shape u32[2]{0}
*** Begin stack trace ***
_PyModule_ClearDict
PyImport_Cleanup
Py_FinalizeEx
Py_Exit
PyRun_SimpleFileExFlags
Py_BytesMain
__libc_start_main
_start
*** End stack trace ***
2021-04-28 14:24:17.748395: E external/org_tensorflow/tensorflow/compiler/xla/pjrt/pjrt_stream_executor_client.cc:1955] Execution of replica 0 failed: Internal: RET_CHECK failure (external/org_tensorflow/tensorflow/compiler/xla/service/gpu/outfeed_thunk.cc:80) ShapeUtil::Equal(source_slices_[index].shape, output_shape) Mismatch between outfeed output buffer shape s64[] and outfeed source buffer shape u32[2]{0}
2021-04-28 14:24:17.748612: F external/org_tensorflow/tensorflow/compiler/xla/python/outfeed_receiver.cc:269] Check failed: SendShutdownOutfeedHeader(device_idx).ok()
Fatal Python error: Aborted
Current thread 0x00007fa74fc24740 (most recent call first):
<no Python frame>
zsh: IOT instruction (core dumped) pytest -n0 Test/Operator/ -s -k "closure" |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 4 replies
-
Are you using multiple GPUs? There are some known bugs in the current |
Beta Was this translation helpful? Give feedback.
-
For reference, it's because of this #5577 |
Beta Was this translation helpful? Give feedback.
@gnecula
Are you using multiple GPUs? There are some known bugs in the current
jaxlib
release with multiple GPUs andexperimental.host_callback
.