-
Notifications
You must be signed in to change notification settings - Fork 70
Open
Labels
more-info-neededWaiting for more informationWaiting for more information
Description
Description
Our Electra-based model takes about 540 ms per inference on CPU with ONNX Runtime (via the mcr.microsoft.com/azureml/onnxruntime:v1.4.0 container). The same model run through Triton r21.02 takes 1000+ ms on average. We've also tried with Triton r20.09, same result.
Triton Information
21.02
Are you using the Triton container or did you build it yourself?
Container, nvcr.io/nvidia/tritonserver:21.02-py3 and nvcr.io/nvidia/tritonserver:20.09-py3.
To Reproduce
I cannot share the full model but it's a PyTorch Transformer-based model exported from HuggingFace to ONNX.
Expected behavior
The inference time on CPU in Triton should be about the same as in ONNX Runtime directly.
Metadata
Metadata
Assignees
Labels
more-info-neededWaiting for more informationWaiting for more information