-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Description
Describe the issue
I have an in-house model in safetensors format. I used the following command to convert it to ONNX format, quantize and optimize:
olive auto-opt --model_name_or_path . --output_path . --device gpu --provider CUDAExecutionProvider --precision int4 --use_model_builder --log_level 1
The model behaves inappropriately when used in the prefill model or for the speculative decoding. For reminder, in the both cases a model is fed with a number of tokens, and the model processes all them in parallel in one iteration. The time for any model to process N tokens in parallel is expected to be the same for any N - that's the principle which the speed-up due to the speculative decoding is based upon.
However, the model I produced with the above command demonstrates that the inference time aggressively depends on N. Here is an example with typical figures:
N = 1, time = 29ms
N = 2, time = 100ms
N = 3, time = 195ms
N = 4, time = 195ms
N = 5, time = 195ms
You see that the time grows with N, then "saturates" at N = 3. Due to this behavior, speculative decoding optimization becomes impractical.
I believe there is a bug in olive. The evidence for the claim:
- the same model when converted to ONNX without quantization behaves as expected, i.e. the time to process N tokens in parallel is constant and does not depend on N
- the ONNX models from Huggingface behave as expected too.
Is there anything else that I am missing?
To reproduce
The conversion command:
olive auto-opt --model_name_or_path . --output_path . --device gpu --provider CUDAExecutionProvider --precision int4 --use_model_builder --log_level 1
I am not allowed to provide the model.
Urgency
No response
Platform
Linux
OS Version
Linux + CUDA
ONNX Runtime Installation
Built from Source
ONNX Runtime Version or Commit ID
1.20.0
ONNX Runtime API
C++
Architecture
X64
Execution Provider
CUDA
Execution Provider Library Version
CUDA 12.2