Is there a way of inferring on models with int8 MoQ? #2412
Unanswered
EdouardVilain-Git
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi all, thanks for the awesome work,
I am working with Deepspeed to apply MoQ on BERT like transformer architectures (XLM-RoBERTa to be precise). I have been able to train the model on 2*GPUs with the following configurations:
{ "train_batch_size": 32, "steps_per_print": 50, "gradient_accumulation_steps": 1, "zero_optimization": { "stage": 0 }, "fp16": { "enabled": true }, "compression_training": { "weight_quantization": { "shared_parameters": { "enabled": true, "quantizer_kernel": false, "schedule_offset": 0, "quantize_groups": 64, "quantize_verbose": true, "quantization_type": "symmetric", "quantize_weight_in_forward": false, "rounding": "nearest", "fp16_mixed_quantize": { "enabled": false, "quantize_change_ratio": 0.1 } }, "different_groups": { "wq1": { "params": { "start_bits": 16, "target_bits": 8, "quantization_period": 350 }, "modules": ["attention.self", "intermediate", "word_embeddings", "output.dense", "pooler.dense", "category_embeddings"] } } }, "activation_quantization": { "shared_parameters": { "enabled": true, "quantization_type": "symmetric", "range_calibration": "dynamic", "schedule_offset": 0 }, "different_groups": { "aq1": { "params": { "bits": 8 }, "modules": ["attention.self", "intermediate", "output.dense"] } } } } }I am now trying to infer using:
engine = deepspeed.init_inference( deepspeed_trainer.model, dtype=torch.int8, quantization_setting=(False,64), replace_with_kernel_inject=True )Several issues have already been created regarding the problem this causes (#2301), but it seems like int8 inference is not yet supported by Deepspeed because int8 kernels have not yet been released.
Nevertheless, I would like to find a solution to use this quantized model for inference. I have thought about using CPU but I would need to bypass the Deepspeed InferenceEngine in that case. Is there any way of loading the quantized model into a PyTorch model to enable int8 CPU inference?
I guess a simple way of doing so would be to run Post-Training Dynamic Quantization on the already quantized weights using Torch's quantization module but this is far from elegant. Hoping to find something a bit better!
Beta Was this translation helpful? Give feedback.
All reactions