Is there a way of inferring on models with int8 MoQ? #2412

EdouardVilain-Git · 2022-10-11T08:20:11Z

EdouardVilain-Git
Oct 11, 2022

Hi all, thanks for the awesome work,

I am working with Deepspeed to apply MoQ on BERT like transformer architectures (XLM-RoBERTa to be precise). I have been able to train the model on 2*GPUs with the following configurations:

{ "train_batch_size": 32, "steps_per_print": 50, "gradient_accumulation_steps": 1, "zero_optimization": { "stage": 0 }, "fp16": { "enabled": true }, "compression_training": { "weight_quantization": { "shared_parameters": { "enabled": true, "quantizer_kernel": false, "schedule_offset": 0, "quantize_groups": 64, "quantize_verbose": true, "quantization_type": "symmetric", "quantize_weight_in_forward": false, "rounding": "nearest", "fp16_mixed_quantize": { "enabled": false, "quantize_change_ratio": 0.1 } }, "different_groups": { "wq1": { "params": { "start_bits": 16, "target_bits": 8, "quantization_period": 350 }, "modules": ["attention.self", "intermediate", "word_embeddings", "output.dense", "pooler.dense", "category_embeddings"] } } }, "activation_quantization": { "shared_parameters": { "enabled": true, "quantization_type": "symmetric", "range_calibration": "dynamic", "schedule_offset": 0 }, "different_groups": { "aq1": { "params": { "bits": 8 }, "modules": ["attention.self", "intermediate", "output.dense"] } } } } }

I am now trying to infer using:

engine = deepspeed.init_inference( deepspeed_trainer.model, dtype=torch.int8, quantization_setting=(False,64), replace_with_kernel_inject=True )

Several issues have already been created regarding the problem this causes (#2301), but it seems like int8 inference is not yet supported by Deepspeed because int8 kernels have not yet been released.

Nevertheless, I would like to find a solution to use this quantized model for inference. I have thought about using CPU but I would need to bypass the Deepspeed InferenceEngine in that case. Is there any way of loading the quantized model into a PyTorch model to enable int8 CPU inference?
I guess a simple way of doing so would be to run Post-Training Dynamic Quantization on the already quantized weights using Torch's quantization module but this is far from elegant. Hoping to find something a bit better!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Is there a way of inferring on models with int8 MoQ? #2412

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Uh oh!

Is there a way of inferring on models with int8 MoQ? #2412

Uh oh!

Uh oh!

EdouardVilain-Git Oct 11, 2022

Replies: 0 comments

EdouardVilain-Git
Oct 11, 2022