Skip to content

CPU inference is much slower than with ONNX Runtime directly #34

Open
@artmatsak

Description

@artmatsak

Description
Our Electra-based model takes about 540 ms per inference on CPU with ONNX Runtime (via the mcr.microsoft.com/azureml/onnxruntime:v1.4.0 container). The same model run through Triton r21.02 takes 1000+ ms on average. We've also tried with Triton r20.09, same result.

Triton Information
21.02

Are you using the Triton container or did you build it yourself?
Container, nvcr.io/nvidia/tritonserver:21.02-py3 and nvcr.io/nvidia/tritonserver:20.09-py3.

To Reproduce

I cannot share the full model but it's a PyTorch Transformer-based model exported from HuggingFace to ONNX.

Expected behavior
The inference time on CPU in Triton should be about the same as in ONNX Runtime directly.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions