-
Notifications
You must be signed in to change notification settings - Fork 64
Onnxruntime backend error when workload is high since Triton uses CUDA 12 #203
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks for reporting - are there a set of public models that can be used to reproduce? Am going to transfer to the onnx runtime backend for tracking and support. |
I believe it's not so related to specific models but a general problem. Sadly I don't have public models to share. |
We are also running into this issue and I can confirm that version 22.10 was also working fine. We started seeing specifically the @zeruniverse We are using CenterNet detection models. What kind of models are you using? |
Also running this problem after upgrade from 22.12 to 23.04 and later. |
I have the same issue! |
We are having the same issue while running yolov11 models on
did anyone solve the issue? |
It seems it's still not fixed yet. I have converted all our models to libtorch model and use pytorch_backend. The inference is slightly slower but it's stable. |
Description
When workload is high, some model in Triton ONNXRUNTIME backend will fail. And after it fails, it will never succeed again. Failures will look like:
(also see microsoft/onnxruntime#12288 for this, I'm not the only one facing this problem)
and
After the
failed to allocate memory
issue occurs, I opened nvidia-smi to check memory usage and the usage peak does not reach 100%, but all subsequent inferences will fail.Following images shows prometheus dashboard, when the model says

failed to allocate memory
, GRAM usage is actually lowTriton Information
What version of Triton are you using?
Tried r23.03, r23.05, r23.06. All with same problem. R22.07 is ok.
Are you using the Triton container or did you build it yourself?
Triton container
To Reproduce
Steps to reproduce the behavior.
Put ~30 ONNXruntime models in Triton, set memory_arena_shrinkage and keeps running them until one model says
SafeIntOnOverflow
orFailed to allocate memory
. After that, that model will never succeed again unless you restart TritonExpected behavior
A clear and concise description of what you expected to happen.
SafeIntOnOverflow
should not happen. I never see this error in r22.07Failed to allocate memory
should only happen if GRAM is really full. And onceFailed to allocate memory
happens, it should not raise this error again after other models finish inferencing and arena_shrinkage returns GRAM to system.The text was updated successfully, but these errors were encountered: