Skip to content

opencv error on production #1302

@homework36

Description

@homework36

We get this periodically on gpu-celery since May 19 or earlier.

2025-05-28 00:05:37.280487: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcuda.so.1
2025-05-28 00:05:37.286499: E tensorflow/stream_executor/cuda/cuda_driver.cc:328] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2025-05-28 00:05:37.286580: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: 97a1c44cfb72
2025-05-28 00:05:37.286591: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: 97a1c44cfb72
2025-05-28 00:05:37.286828: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: 550.54.15
2025-05-28 00:05:37.286881: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 550.54.15
2025-05-28 00:05:37.286887: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:310] kernel version seems to match DSO: 550.54.15
2025-05-28 00:05:37.333285: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-05-28 00:05:40.358274: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:176] None of the MLIR Optimization Passes are enabled (registered 2)
2025-05-28 00:05:40.358975: I tensorflow/core/platform/profile_utils/cpu_utils.cc:114] CPU Frequency: 2494140000 Hz
2025-05-28 14:55:17.517546: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcuda.so.1
2025-05-28 14:55:17.521837: E tensorflow/stream_executor/cuda/cuda_driver.cc:328] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2025-05-28 14:55:17.521910: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: 97a1c44cfb72
2025-05-28 14:55:17.521920: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: 97a1c44cfb72
2025-05-28 14:55:17.522183: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: 550.54.15
2025-05-28 14:55:17.522238: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 550.54.15
2025-05-28 14:55:17.522243: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:310] kernel version seems to match DSO: 550.54.15
2025-05-28 14:55:17.565394: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-05-28 14:55:20.701897: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:176] None of the MLIR Optimization Passes are enabled (registered 2)
2025-05-28 14:55:20.702611: I tensorflow/core/platform/profile_utils/cpu_utils.cc:114] CPU Frequency: 2494140000 Hz
2025-05-28 14:55:22.471843: W tensorflow/core/framework/op_kernel.cc:1755] Unknown: error: OpenCV(4.1.2) /io/opencv/modules/imgproc/src/filter.dispatch.cpp:140: error: (-215:Assertion failed) 0 <= anchor.x && anchor.x < ksize.width && 0 <= anchor.y && anchor.y < ksize.height in function 'init'

Traceback (most recent call last):

  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/ops/script_ops.py", line 249, in __call__
    ret = func(*args)

  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/autograph/impl/api.py", line 645, in wrapper
    return func(*args, **kwargs)

  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/data/ops/dataset_ops.py", line 961, in generator_py_func
    values = next(generator_state.get_iterator(iterator_id))

  File "/usr/local/lib/python3.7/dist-packages/tfaip/data/pipeline/runningdatapipeline.py", line 164, in generator
    for s in samples:

  File "/usr/local/lib/python3.7/dist-packages/tfaip/data/pipeline/runningdatapipeline.py", line 214, in _generate_input_samples
    for s in generate:

  File "/usr/local/lib/python3.7/dist-packages/tfaip/data/pipeline/processor/sample/processorpipeline.py", line 91, in _apply
    r = processor.apply_on_sample(sample)

  File "/usr/local/lib/python3.7/dist-packages/tfaip/data/pipeline/processor/dataprocessor.py", line 231, in apply_on_sample
    return self.apply(sample.copy())

  File "/usr/local/lib/python3.7/dist-packages/tfaip/data/pipeline/processor/dataprocessor.py", line 257, in apply
    sample = p(sample)

  File "/usr/local/lib/python3.7/dist-packages/typeguard/__init__.py", line 1033, in wrapper
    retval = func(*args, **kwargs)

  File "/usr/local/lib/python3.7/dist-packages/tfaip/data/pipeline/processor/dataprocessor.py", line 147, in __call__
    return self.apply(sample)

  File "/usr/local/lib/python3.7/dist-packages/calamari_ocr/ocr/dataset/imageprocessors/data_preprocessor.py", line 8, in apply
    return sample.new_inputs(self._apply_single(sample.inputs, sample.meta))

  File "/usr/local/lib/python3.7/dist-packages/calamari_ocr/ocr/dataset/imageprocessors/center_normalizer.py", line 35, in _apply_single
    out, params = self.normalize(data.astype(np.uint8))

  File "/usr/local/lib/python3.7/dist-packages/calamari_ocr/ocr/dataset/imageprocessors/center_normalizer.py", line 138, in normalize
    dewarped = self.dewarp(img, cval=cval)

  File "/usr/local/lib/python3.7/dist-packages/calamari_ocr/ocr/dataset/imageprocessors/center_normalizer.py", line 96, in dewarp
    center, r = self.measure(inverted)

  File "/usr/local/lib/python3.7/dist-packages/calamari_ocr/ocr/dataset/imageprocessors/center_normalizer.py", line 51, in measure
    smoothed += 0.001 * cv.blur(smoothed, (w, int(h * 0.5)), borderType=cv.BORDER_CONSTANT)

cv2.error: OpenCV(4.1.2) /io/opencv/modules/imgproc/src/filter.dispatch.cpp:140: error: (-215:Assertion failed) 0 <= anchor.x && anchor.x < ksize.width && 0 <= anchor.y && anchor.y < ksize.height in function 'init'

Inside gpu-celery:

root@d8078d484565:/# nvidia-smi
Failed to initialize NVML: Unknown Error

As a result, current GPU jobs are all running with CPU (and very slow).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions