CPU inference is much slower than with ONNX Runtime directly

**Description**
Our Electra-based model takes about 540 ms per inference on CPU with ONNX Runtime (via the mcr.microsoft.com/azureml/onnxruntime:v1.4.0 container). The same model run through Triton r21.02 takes 1000+ ms on average. We've also tried with Triton r20.09, same result.

**Triton Information**
21.02

Are you using the Triton container or did you build it yourself?
Container, nvcr.io/nvidia/tritonserver:21.02-py3 and nvcr.io/nvidia/tritonserver:20.09-py3.

**To Reproduce**

I cannot share the full model but it's a PyTorch Transformer-based model exported from HuggingFace to ONNX.

**Expected behavior**
The inference time on CPU in Triton should be about the same as in ONNX Runtime directly.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CPU inference is much slower than with ONNX Runtime directly #34

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

CPU inference is much slower than with ONNX Runtime directly #34

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions