Write here the question.
I used this command and generated two formats of files.
optimum-cli export onnx --model {MODEL_ID} --task 'text2text-generation-with-past' --no-post-process {ONNX_BASE_DIR}
I understand this situation. I used the following code directly for quantization, but I couldn't get a scale as small as NLLB_cache_initializer.onnx(separated the component of the decoder without past that generate the kv-cache of the encoder part (NLLB_cache_initializer.onnx)
##code
quantize_dynamic(
model_input="/Users/zhangboxu/workspace/nllb/nllb_onnx/decoder/decoder_model.onnx",
model_output="/Users/zhangboxu/workspace/nllb/nllb_onnx/decoder/decoder_model_int8.onnx",
weight_type=QuantType.QInt8
)