Open
Description
Description
Our Electra-based model takes about 540 ms per inference on CPU with ONNX Runtime (via the mcr.microsoft.com/azureml/onnxruntime:v1.4.0 container). The same model run through Triton r21.02 takes 1000+ ms on average. We've also tried with Triton r20.09, same result.
Triton Information
21.02
Are you using the Triton container or did you build it yourself?
Container, nvcr.io/nvidia/tritonserver:21.02-py3 and nvcr.io/nvidia/tritonserver:20.09-py3.
To Reproduce
I cannot share the full model but it's a PyTorch Transformer-based model exported from HuggingFace to ONNX.
Expected behavior
The inference time on CPU in Triton should be about the same as in ONNX Runtime directly.