-
Notifications
You must be signed in to change notification settings - Fork 13
Open
Description
We get this periodically on gpu-celery
since May 19 or earlier.
2025-05-28 00:05:37.280487: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcuda.so.1
2025-05-28 00:05:37.286499: E tensorflow/stream_executor/cuda/cuda_driver.cc:328] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2025-05-28 00:05:37.286580: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: 97a1c44cfb72
2025-05-28 00:05:37.286591: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: 97a1c44cfb72
2025-05-28 00:05:37.286828: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: 550.54.15
2025-05-28 00:05:37.286881: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 550.54.15
2025-05-28 00:05:37.286887: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:310] kernel version seems to match DSO: 550.54.15
2025-05-28 00:05:37.333285: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-05-28 00:05:40.358274: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:176] None of the MLIR Optimization Passes are enabled (registered 2)
2025-05-28 00:05:40.358975: I tensorflow/core/platform/profile_utils/cpu_utils.cc:114] CPU Frequency: 2494140000 Hz
2025-05-28 14:55:17.517546: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcuda.so.1
2025-05-28 14:55:17.521837: E tensorflow/stream_executor/cuda/cuda_driver.cc:328] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2025-05-28 14:55:17.521910: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: 97a1c44cfb72
2025-05-28 14:55:17.521920: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: 97a1c44cfb72
2025-05-28 14:55:17.522183: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: 550.54.15
2025-05-28 14:55:17.522238: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 550.54.15
2025-05-28 14:55:17.522243: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:310] kernel version seems to match DSO: 550.54.15
2025-05-28 14:55:17.565394: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-05-28 14:55:20.701897: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:176] None of the MLIR Optimization Passes are enabled (registered 2)
2025-05-28 14:55:20.702611: I tensorflow/core/platform/profile_utils/cpu_utils.cc:114] CPU Frequency: 2494140000 Hz
2025-05-28 14:55:22.471843: W tensorflow/core/framework/op_kernel.cc:1755] Unknown: error: OpenCV(4.1.2) /io/opencv/modules/imgproc/src/filter.dispatch.cpp:140: error: (-215:Assertion failed) 0 <= anchor.x && anchor.x < ksize.width && 0 <= anchor.y && anchor.y < ksize.height in function 'init'
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/ops/script_ops.py", line 249, in __call__
ret = func(*args)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/autograph/impl/api.py", line 645, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/data/ops/dataset_ops.py", line 961, in generator_py_func
values = next(generator_state.get_iterator(iterator_id))
File "/usr/local/lib/python3.7/dist-packages/tfaip/data/pipeline/runningdatapipeline.py", line 164, in generator
for s in samples:
File "/usr/local/lib/python3.7/dist-packages/tfaip/data/pipeline/runningdatapipeline.py", line 214, in _generate_input_samples
for s in generate:
File "/usr/local/lib/python3.7/dist-packages/tfaip/data/pipeline/processor/sample/processorpipeline.py", line 91, in _apply
r = processor.apply_on_sample(sample)
File "/usr/local/lib/python3.7/dist-packages/tfaip/data/pipeline/processor/dataprocessor.py", line 231, in apply_on_sample
return self.apply(sample.copy())
File "/usr/local/lib/python3.7/dist-packages/tfaip/data/pipeline/processor/dataprocessor.py", line 257, in apply
sample = p(sample)
File "/usr/local/lib/python3.7/dist-packages/typeguard/__init__.py", line 1033, in wrapper
retval = func(*args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/tfaip/data/pipeline/processor/dataprocessor.py", line 147, in __call__
return self.apply(sample)
File "/usr/local/lib/python3.7/dist-packages/calamari_ocr/ocr/dataset/imageprocessors/data_preprocessor.py", line 8, in apply
return sample.new_inputs(self._apply_single(sample.inputs, sample.meta))
File "/usr/local/lib/python3.7/dist-packages/calamari_ocr/ocr/dataset/imageprocessors/center_normalizer.py", line 35, in _apply_single
out, params = self.normalize(data.astype(np.uint8))
File "/usr/local/lib/python3.7/dist-packages/calamari_ocr/ocr/dataset/imageprocessors/center_normalizer.py", line 138, in normalize
dewarped = self.dewarp(img, cval=cval)
File "/usr/local/lib/python3.7/dist-packages/calamari_ocr/ocr/dataset/imageprocessors/center_normalizer.py", line 96, in dewarp
center, r = self.measure(inverted)
File "/usr/local/lib/python3.7/dist-packages/calamari_ocr/ocr/dataset/imageprocessors/center_normalizer.py", line 51, in measure
smoothed += 0.001 * cv.blur(smoothed, (w, int(h * 0.5)), borderType=cv.BORDER_CONSTANT)
cv2.error: OpenCV(4.1.2) /io/opencv/modules/imgproc/src/filter.dispatch.cpp:140: error: (-215:Assertion failed) 0 <= anchor.x && anchor.x < ksize.width && 0 <= anchor.y && anchor.y < ksize.height in function 'init'
Inside gpu-celery
:
root@d8078d484565:/# nvidia-smi
Failed to initialize NVML: Unknown Error
As a result, current GPU jobs are all running with CPU (and very slow).
Metadata
Metadata
Assignees
Labels
No labels