-
Notifications
You must be signed in to change notification settings - Fork 63
CPU inference is much slower than with ONNX Runtime directly #34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Will investigate this and report back on my findings |
Hi @artmatsak,
Previously triton ORT backend always set the number of threads used to parallelize the execution as 1 and it has been fixed recently: #67 The fix has been included in the recent 21.09 release. As you can see, removing thread=1 can make the CPU performance on Triton at least be close to the Standalone ORT with CPU. However, you bumped into this issue with Triton 21.02 and at that time triton ORT backend was still using openmp and thread_number did not take any effect. Thus, things might be much different now (it does not use openmp anymore). Could you please try whether the latest 21.09 resolve the CPU performance issue with your Bert model? Let's confirm that whether this issue has been resolved in the latest build. Thank you! |
I use Triton ORT 21.09-py3 I also have the same problem too. When I run pref the cpu is about 100%, but the QPS is not have a ideal accelerate. |
@GuoGuiRong Are you comparing ORT with Triton-ORT? Can you add more details regarding
|
Hi, I'm getting the same issue, running ORT directly is about 3x faster. For the Triton model config this: input [ output { I have tried the various optimisation parameters suggested in the backend repo, but these seem to make the performance worse. |
Is there any update for this issue? |
Thanks for sharing your experiences. In your table, there is 2 ms difference between the inference time of the model on GPU for triton and also ORT directly. Does anybody know why this difference existed? |
Hi, I found this this is still exists on Triton 22.10 version. |
Maybe related #265 (comment) ? |
Can anyone do a help on this issue? |
Description
Our Electra-based model takes about 540 ms per inference on CPU with ONNX Runtime (via the mcr.microsoft.com/azureml/onnxruntime:v1.4.0 container). The same model run through Triton r21.02 takes 1000+ ms on average. We've also tried with Triton r20.09, same result.
Triton Information
21.02
Are you using the Triton container or did you build it yourself?
Container, nvcr.io/nvidia/tritonserver:21.02-py3 and nvcr.io/nvidia/tritonserver:20.09-py3.
To Reproduce
I cannot share the full model but it's a PyTorch Transformer-based model exported from HuggingFace to ONNX.
Expected behavior
The inference time on CPU in Triton should be about the same as in ONNX Runtime directly.
The text was updated successfully, but these errors were encountered: