Skip to content

olive: a weird behavior of a model converted to ONNX format #25600

@idruker-cerence

Description

@idruker-cerence

Describe the issue

I have an in-house model in safetensors format. I used the following command to convert it to ONNX format, quantize and optimize:

olive auto-opt --model_name_or_path . --output_path . --device gpu --provider CUDAExecutionProvider --precision int4 --use_model_builder --log_level 1

The model behaves inappropriately when used in the prefill model or for the speculative decoding. For reminder, in the both cases a model is fed with a number of tokens, and the model processes all them in parallel in one iteration. The time for any model to process N tokens in parallel is expected to be the same for any N - that's the principle which the speed-up due to the speculative decoding is based upon.

However, the model I produced with the above command demonstrates that the inference time aggressively depends on N. Here is an example with typical figures:

N = 1, time = 29ms
N = 2, time = 100ms
N = 3, time = 195ms
N = 4, time = 195ms
N = 5, time = 195ms

You see that the time grows with N, then "saturates" at N = 3. Due to this behavior, speculative decoding optimization becomes impractical.

I believe there is a bug in olive. The evidence for the claim:

  • the same model when converted to ONNX without quantization behaves as expected, i.e. the time to process N tokens in parallel is constant and does not depend on N
  • the ONNX models from Huggingface behave as expected too.

Is there anything else that I am missing?

To reproduce

The conversion command:

olive auto-opt --model_name_or_path . --output_path . --device gpu --provider CUDAExecutionProvider --precision int4 --use_model_builder --log_level 1

I am not allowed to provide the model.

Urgency

No response

Platform

Linux

OS Version

Linux + CUDA

ONNX Runtime Installation

Built from Source

ONNX Runtime Version or Commit ID

1.20.0

ONNX Runtime API

C++

Architecture

X64

Execution Provider

CUDA

Execution Provider Library Version

CUDA 12.2

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions