Onnxruntime backend error when workload is high since Triton uses CUDA 12 #203

zeruniverse · 2023-07-08T02:37:09Z

Description

When workload is high, some model in Triton ONNXRUNTIME backend will fail. And after it fails, it will never succeed again. Failures will look like:

"[StatusCode.INTERNAL] onnx runtime error 6: Non-zero status code returned while running Conv node. Name:'/model.8/cv1/conv/Conv' Status Message: /workspace/onnxruntime/onnxruntime/core/common/safeint.h:17 static void SafeIntExceptionHandler<onnxruntime::OnnxRuntimeException>::SafeIntOnOverflow() Integer overflow

(also see microsoft/onnxruntime#12288 for this, I'm not the only one facing this problem)

and

"[StatusCode.INTERNAL] onnx runtime error 6: Non-zero status code returned while running Conv node. Name:'/model.1/model/model.0/model.0.1/block/block.3/block.3.0/Conv' Status Message: /workspace/onnxruntime/onnxruntime/core/framework/bfc_arena.cc:368 void* onnxruntime::BFCArena::AllocateRawInternal(size_t, bool, onnxruntime::Stream*, bool, onnxruntime::WaitNotificationFn) Failed to allocate memory for requested buffer of size 4547406516732812544

After the failed to allocate memory issue occurs, I opened nvidia-smi to check memory usage and the usage peak does not reach 100%, but all subsequent inferences will fail.

Following images shows prometheus dashboard, when the model says failed to allocate memory, GRAM usage is actually low

Triton Information
What version of Triton are you using?

Tried r23.03, r23.05, r23.06. All with same problem. R22.07 is ok.

Are you using the Triton container or did you build it yourself?
Triton container

To Reproduce
Steps to reproduce the behavior.

Put ~30 ONNXruntime models in Triton, set memory_arena_shrinkage and keeps running them until one model says SafeIntOnOverflow or Failed to allocate memory. After that, that model will never succeed again unless you restart Triton

Expected behavior
A clear and concise description of what you expected to happen.

SafeIntOnOverflow should not happen. I never see this error in r22.07

Failed to allocate memory should only happen if GRAM is really full. And once Failed to allocate memory happens, it should not raise this error again after other models finish inferencing and arena_shrinkage returns GRAM to system.

The text was updated successfully, but these errors were encountered:

nnshah1 · 2023-07-08T19:56:26Z

Thanks for reporting - are there a set of public models that can be used to reproduce? Am going to transfer to the onnx runtime backend for tracking and support.

zeruniverse · 2023-07-09T12:54:32Z

Thanks for reporting - are there a set of public models that can be used to reproduce? Am going to transfer to the onnx runtime backend for tracking and support.

I believe it's not so related to specific models but a general problem. Sadly I don't have public models to share.

OvervCW · 2023-12-07T10:55:47Z

We are also running into this issue and I can confirm that version 22.10 was also working fine. We started seeing specifically the Failed to allocate memory for requested buffer of size 13622061778317179392 type errors when we upgraded from that version to 23.10.

@zeruniverse We are using CenterNet detection models. What kind of models are you using?

makavity · 2023-12-11T16:53:04Z

Also running this problem after upgrade from 22.12 to 23.04 and later.
Failed to allocate memory for requested buffer of size 4494297792244863488

DataXujing · 2024-10-12T10:04:43Z

I have the same issue!

kaplansinan · 2025-01-12T11:51:38Z

We are having the same issue while running yolov11 models on trition-inference-server(24.03).

Onnx runtime error 6: Non-zero status code returned while running Conv node. Name:'/model.2/cv1/conv/Conv' Status Message: /workspace/onnxruntime/onnxruntime/core/common/safeint.h:17 static void SafeIntExceptionHandler<onnxruntime::OnnxRuntimeException>::SafeIntOnOverflow() Integer overflow

did anyone solve the issue?

zeruniverse · 2025-02-25T04:13:32Z

It seems it's still not fixed yet. I have converted all our models to libtorch model and use pytorch_backend. The inference is slightly slower but it's stable.

nnshah1 added the bug Something isn't working label Jul 8, 2023

nnshah1 transferred this issue from triton-inference-server/server Jul 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Onnxruntime backend error when workload is high since Triton uses CUDA 12 #203

Onnxruntime backend error when workload is high since Triton uses CUDA 12 #203

zeruniverse commented Jul 8, 2023

nnshah1 commented Jul 8, 2023

zeruniverse commented Jul 9, 2023

OvervCW commented Dec 7, 2023 •

edited

Loading

makavity commented Dec 11, 2023 •

edited

Loading

DataXujing commented Oct 12, 2024

kaplansinan commented Jan 12, 2025 •

edited

Loading

zeruniverse commented Feb 25, 2025

Onnxruntime backend error when workload is high since Triton uses CUDA 12 #203

Onnxruntime backend error when workload is high since Triton uses CUDA 12 #203

Comments

zeruniverse commented Jul 8, 2023

nnshah1 commented Jul 8, 2023

zeruniverse commented Jul 9, 2023

OvervCW commented Dec 7, 2023 • edited Loading

makavity commented Dec 11, 2023 • edited Loading

DataXujing commented Oct 12, 2024

kaplansinan commented Jan 12, 2025 • edited Loading

zeruniverse commented Feb 25, 2025

OvervCW commented Dec 7, 2023 •

edited

Loading

makavity commented Dec 11, 2023 •

edited

Loading

kaplansinan commented Jan 12, 2025 •

edited

Loading