-
Notifications
You must be signed in to change notification settings - Fork 364
🐛 [Bug] Difficulties Quantizing FP16 Models to INT8 Using torch_tensorrt (MLP, CNN, Attention, LSTM, Transformer) #3494
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks for filing such a detailed bug report, we havent had a chance to dig into it but generally using the DataloaderCalibrator from the TorchScript frontend is deprecated as it uses deprecated TRT APIs. The new workflow uses the TensorRT Model Optimizer Toolkit https://pytorch.org/TensorRT/tutorials/_rendered_examples/dynamo/vgg16_ptq.html |
Hi @narendasan thank you for your reply! Recently, I’ve done some further exploration and gained a better understanding of the issue I previously raised. I’d like to update this issue here (as well as clarify some of my earlier misunderstandings about torch_tensorrt). First of all, regarding the model optimization workflow you mentioned using modelopt, I did follow the documentation and tried it out. However, in that workflow, after optimizing the model with modelopt, it still needs to be compiled using torch_tensorrt. During the compilation process, I encountered the same error I described in my original post. In short, I believe simply replacing the original calibration process with modelopt does not fully resolve all the errors I’m facing. I suspect some of the issues are actually caused by torch_tensorrt.compile itself. Next, I want to clarify a mistake I made in my original post. Specifically, in the error log I shared, some of the errors are:
After trying to manually perform model quantization using the TensorRT Python API, I found that this error can partly be attributed to a mismatch between the batch size of the calibration dataset and the opt_shape profile used for model optimization. For example, in the demo code, I set the calibration dataset shape to (1, 1024, 512), but the model’s opt_shape was (256, 1024, 512). I believe this is an important detail that should be explicitly stated in the documentation — because, to be honest, the error message is quite ambiguous. Lastly, in case other users encounter similar issues, I’d like to share a workaround that has worked for me. As mentioned above, I now directly use the TensorRT Python API to perform model quantization. The full workflow is: first export the model to ONNX using PyTorch, then use TensorRT to load the ONNX model, perform calibration and quantization, and finally build the inference engine. Using this approach, I was able to successfully quantize all five of the toy models shown in the sample code to INT8, while maintaining very good accuracy. One potential issue is that ONNX model serialization has a 2GB size limit, so for larger models, alternative methods may be needed. |
@NisFu-gh Also I looked at your example, it is still use the calibrator which is already deprecated by TensorRT. |
|
Uh oh!
There was an error while loading. Please reload this page.
Bug Description
I am trying to quantize already trained FP16 models to INT8 precision using torch_tensorrt and accelerate inference with TensorRT engines. However, during this process, I encountered several different issues — either inside torch_tensorrt or TensorRT itself (not entirely sure).
In most cases, the models fail to pass the quantize and/or compile process.
To Reproduce
Here is the minimal reproducible code:
Expected Behavior
Successfully compile FP16 models to INT8 TensorRT engines, also maintain reasonable inference accuracy and performance.
Actual Behavior
In most cases, the compilation fails or the resulting models cannot run correctly. Below is a summary to the results that I tested:
And the corresponding error log is (due to the length limitation I must upload a file) error_log.txt.
Environment
conda
,pip
,libtorch
, source): pipQuestions
Am I using torch_tensorrt incorrectly?
Are there any important documentation notes or best practices regarding compilation and quantization that I might have missed?
Whats the correct way (or official suggestion) to do this task, specifically given a fp16 model then build a int8 quantized version and inference the model with TensorRT backend?
Any help would be greatly appreciated! Thank you in advance!
Additional context
N/A
The text was updated successfully, but these errors were encountered: