Skip to content

CPU inference is much slower than with ONNX Runtime directly #34

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
artmatsak opened this issue Mar 19, 2021 · 10 comments
Open

CPU inference is much slower than with ONNX Runtime directly #34

artmatsak opened this issue Mar 19, 2021 · 10 comments
Assignees
Labels
more-info-needed Waiting for more information

Comments

@artmatsak
Copy link

artmatsak commented Mar 19, 2021

Description
Our Electra-based model takes about 540 ms per inference on CPU with ONNX Runtime (via the mcr.microsoft.com/azureml/onnxruntime:v1.4.0 container). The same model run through Triton r21.02 takes 1000+ ms on average. We've also tried with Triton r20.09, same result.

Triton Information
21.02

Are you using the Triton container or did you build it yourself?
Container, nvcr.io/nvidia/tritonserver:21.02-py3 and nvcr.io/nvidia/tritonserver:20.09-py3.

To Reproduce

I cannot share the full model but it's a PyTorch Transformer-based model exported from HuggingFace to ONNX.

Expected behavior
The inference time on CPU in Triton should be about the same as in ONNX Runtime directly.

@askhade
Copy link
Contributor

askhade commented Sep 13, 2021

Will investigate this and report back on my findings

@askhade askhade self-assigned this Sep 13, 2021
@jcwchen
Copy link
Contributor

jcwchen commented Oct 4, 2021

Hi @artmatsak,
Here is my experiment about the CPU performance issue with certain bert model (bert-squad):

Approach ms/sec
Standalone ORT perf_test GPU 15.12ms
Triton r21.08 GPU 13.7ms
Standalone ORT perf_test CPU 223.168ms
Triton r21.08 CPU 666ms
Triton r21.08 CPU (remove thread=1) 227ms

Previously triton ORT backend always set the number of threads used to parallelize the execution as 1 and it has been fixed recently: #67 The fix has been included in the recent 21.09 release. As you can see, removing thread=1 can make the CPU performance on Triton at least be close to the Standalone ORT with CPU.

However, you bumped into this issue with Triton 21.02 and at that time triton ORT backend was still using openmp and thread_number did not take any effect. Thus, things might be much different now (it does not use openmp anymore). Could you please try whether the latest 21.09 resolve the CPU performance issue with your Bert model? Let's confirm that whether this issue has been resolved in the latest build. Thank you!

@johnsGuo
Copy link

johnsGuo commented Nov 9, 2021

I use Triton ORT 21.09-py3 I also have the same problem too. When I run pref the cpu is about 100%, but the QPS is not have a ideal accelerate.

@askhade
Copy link
Contributor

askhade commented Dec 1, 2021

@GuoGuiRong Are you comparing ORT with Triton-ORT? Can you add more details regarding

  1. ORT and Triton-ORT configs used durnig testing
  2. What perf diff are you seeing

@askhade askhade added the more-info-needed Waiting for more information label Dec 1, 2021
@bezdomniy
Copy link

Hi, I'm getting the same issue, running ORT directly is about 3x faster.
I am using HuggingFace transformers.onnx library to convert the model to ORT, and run it using the onnxruntime python client lib.

For the Triton model config this:
`name: "paraphrase-MiniLM-L6-v2"
platform: "onnxruntime_onnx"
max_batch_size: 0

input [
{
name: "input_ids"
data_type: TYPE_INT64
dims: [-1,-1]
},
{
name: "token_type_ids"
data_type: TYPE_INT64
dims: [-1,-1]
},
{
name: "attention_mask"
data_type: TYPE_INT64
dims: [-1,-1]
}
]

output {
name: "last_hidden_state"
data_type: TYPE_FP32
dims: [-1,-1,-1]
}`

I have tried the various optimisation parameters suggested in the backend repo, but these seem to make the performance worse.

@farzanehnakhaee70
Copy link

Is there any update for this issue?

@farzanehnakhaee70
Copy link

Hi @artmatsak, Here is my experiment about the CPU performance issue with certain bert model (bert-squad):

Approach ms/sec
Standalone ORT perf_test GPU 15.12ms
Triton r21.08 GPU 13.7ms
Standalone ORT perf_test CPU 223.168ms
Triton r21.08 CPU 666ms
Triton r21.08 CPU (remove thread=1) 227ms
Previously triton ORT backend always set the number of threads used to parallelize the execution as 1 and it has been fixed recently: #67 The fix has been included in the recent 21.09 release. As you can see, removing thread=1 can make the CPU performance on Triton at least be close to the Standalone ORT with CPU.

However, you bumped into this issue with Triton 21.02 and at that time triton ORT backend was still using openmp and thread_number did not take any effect. Thus, things might be much different now (it does not use openmp anymore). Could you please try whether the latest 21.09 resolve the CPU performance issue with your Bert model? Let's confirm that whether this issue has been resolved in the latest build. Thank you!

Thanks for sharing your experiences. In your table, there is 2 ms difference between the inference time of the model on GPU for triton and also ORT directly. Does anybody know why this difference existed?

@hanswang1
Copy link

Hi, I found this this is still exists on Triton 22.10 version.
Does any one have any idea to have a solution or work around?

@Mitix-EPI
Copy link

Mitix-EPI commented Aug 19, 2024

Maybe related #265 (comment) ?

@hanswang1
Copy link

Maybe related #265 (comment) ?

Can anyone do a help on this issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
more-info-needed Waiting for more information
Projects
None yet
Development

No branches or pull requests

8 participants