opencv error on production

We get this periodically on `gpu-celery` since May 19 or earlier.
```
2025-05-28 00:05:37.280487: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcuda.so.1
2025-05-28 00:05:37.286499: E tensorflow/stream_executor/cuda/cuda_driver.cc:328] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2025-05-28 00:05:37.286580: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: 97a1c44cfb72
2025-05-28 00:05:37.286591: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: 97a1c44cfb72
2025-05-28 00:05:37.286828: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: 550.54.15
2025-05-28 00:05:37.286881: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 550.54.15
2025-05-28 00:05:37.286887: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:310] kernel version seems to match DSO: 550.54.15
2025-05-28 00:05:37.333285: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-05-28 00:05:40.358274: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:176] None of the MLIR Optimization Passes are enabled (registered 2)
2025-05-28 00:05:40.358975: I tensorflow/core/platform/profile_utils/cpu_utils.cc:114] CPU Frequency: 2494140000 Hz
2025-05-28 14:55:17.517546: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcuda.so.1
2025-05-28 14:55:17.521837: E tensorflow/stream_executor/cuda/cuda_driver.cc:328] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2025-05-28 14:55:17.521910: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: 97a1c44cfb72
2025-05-28 14:55:17.521920: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: 97a1c44cfb72
2025-05-28 14:55:17.522183: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: 550.54.15
2025-05-28 14:55:17.522238: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 550.54.15
2025-05-28 14:55:17.522243: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:310] kernel version seems to match DSO: 550.54.15
2025-05-28 14:55:17.565394: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-05-28 14:55:20.701897: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:176] None of the MLIR Optimization Passes are enabled (registered 2)
2025-05-28 14:55:20.702611: I tensorflow/core/platform/profile_utils/cpu_utils.cc:114] CPU Frequency: 2494140000 Hz
2025-05-28 14:55:22.471843: W tensorflow/core/framework/op_kernel.cc:1755] Unknown: error: OpenCV(4.1.2) /io/opencv/modules/imgproc/src/filter.dispatch.cpp:140: error: (-215:Assertion failed) 0 <= anchor.x && anchor.x < ksize.width && 0 <= anchor.y && anchor.y < ksize.height in function 'init'

Traceback (most recent call last):

  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/ops/script_ops.py", line 249, in __call__
    ret = func(*args)

  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/autograph/impl/api.py", line 645, in wrapper
    return func(*args, **kwargs)

  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/data/ops/dataset_ops.py", line 961, in generator_py_func
    values = next(generator_state.get_iterator(iterator_id))

  File "/usr/local/lib/python3.7/dist-packages/tfaip/data/pipeline/runningdatapipeline.py", line 164, in generator
    for s in samples:

  File "/usr/local/lib/python3.7/dist-packages/tfaip/data/pipeline/runningdatapipeline.py", line 214, in _generate_input_samples
    for s in generate:

  File "/usr/local/lib/python3.7/dist-packages/tfaip/data/pipeline/processor/sample/processorpipeline.py", line 91, in _apply
    r = processor.apply_on_sample(sample)

  File "/usr/local/lib/python3.7/dist-packages/tfaip/data/pipeline/processor/dataprocessor.py", line 231, in apply_on_sample
    return self.apply(sample.copy())

  File "/usr/local/lib/python3.7/dist-packages/tfaip/data/pipeline/processor/dataprocessor.py", line 257, in apply
    sample = p(sample)

  File "/usr/local/lib/python3.7/dist-packages/typeguard/__init__.py", line 1033, in wrapper
    retval = func(*args, **kwargs)

  File "/usr/local/lib/python3.7/dist-packages/tfaip/data/pipeline/processor/dataprocessor.py", line 147, in __call__
    return self.apply(sample)

  File "/usr/local/lib/python3.7/dist-packages/calamari_ocr/ocr/dataset/imageprocessors/data_preprocessor.py", line 8, in apply
    return sample.new_inputs(self._apply_single(sample.inputs, sample.meta))

  File "/usr/local/lib/python3.7/dist-packages/calamari_ocr/ocr/dataset/imageprocessors/center_normalizer.py", line 35, in _apply_single
    out, params = self.normalize(data.astype(np.uint8))

  File "/usr/local/lib/python3.7/dist-packages/calamari_ocr/ocr/dataset/imageprocessors/center_normalizer.py", line 138, in normalize
    dewarped = self.dewarp(img, cval=cval)

  File "/usr/local/lib/python3.7/dist-packages/calamari_ocr/ocr/dataset/imageprocessors/center_normalizer.py", line 96, in dewarp
    center, r = self.measure(inverted)

  File "/usr/local/lib/python3.7/dist-packages/calamari_ocr/ocr/dataset/imageprocessors/center_normalizer.py", line 51, in measure
    smoothed += 0.001 * cv.blur(smoothed, (w, int(h * 0.5)), borderType=cv.BORDER_CONSTANT)

cv2.error: OpenCV(4.1.2) /io/opencv/modules/imgproc/src/filter.dispatch.cpp:140: error: (-215:Assertion failed) 0 <= anchor.x && anchor.x < ksize.width && 0 <= anchor.y && anchor.y < ksize.height in function 'init'

```
Inside `gpu-celery`:
```
root@d8078d484565:/# nvidia-smi
Failed to initialize NVML: Unknown Error
```
As a result, current GPU jobs are all running with CPU (and very slow).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

opencv error on production #1302

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

opencv error on production #1302

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions