What is Quantization?

Tutorial / seminar on quantization of DL models.

What is Quantization?

Quantization is essential for reducing the size and computational requirements of AI models, especially for deployment on edge devices with limited resources.
Large neural networks require high memory and processing power, which makes them unsuitable for real-time inference on edge devices like smartphones, IoT devices, and embedded systems.
Quantization involves representing weights and activations of a neural network using lower precision (e.g., 8-bit integers instead of 32-bit floating-point numbers).

Pros

This conversion of data type to integer can reduce memory footprint by upto 4x.
It speedsup inference, as integer operations are faster than floating point operations thus reducing compute time approximately.

Cons

Due to quantization, we expect some loss in accuracy as it introduces approximation in the model weights.
Not all hardware supports all quantization formats (e.g., INT8 vs. FP16).

Symetric vs Assymetric Quantization.

There are two ways to convert the FP values to integers.

Symmetric quantization

The range of floating-point values is mapped symmetrically around zero.
The same scale factor is used for both positive and negative values.
Simple, efficient, works best if data is evenly distributed around zero.
Attached notebook for reference.

Asymmetric quantization

The range of floating-point values is mapped using different scales for positive and negative values.
More precise, better for real-world data that isn’t evenly spread.
Attached notebook for reference.

How a floating point is represented in 32 bit binary format.

Do show show floating point numer is represent in binary format, we'll take an example to convert a floating point number 85.125 into 32 bit binary.
Step 1: Convert 85 to binary : 1010101
Step 2: Convert 0.125 to binary: 001
Step 3: Combine both the part: 85.125 -> 1010101.001
Step 4: We will shift the . to the second positing and multiple by 2*n where n is the number of digit shifted. Therefore 85.125 -> 1010101.001 -> 1.010101001 x 2 ^ 6
Step 5: Add 127 to the exponent and get binary of the resultant number: 127 + 6 = 133 = 10000101
Step 6:
- Now we have everything. The first bit of 32 bit represent sign of the number in our case the number is positive therefore it should be 0.
- The next 8 bit represent exponent so we have binary of 133 that is 10000101
- The remaning value will be the value after . in step 5. That is 010101001 which is also called as mentassa. We place 0 in the padding.
Step 7: Therefore 85.125 in 32 bit format will be 0 10000101 010101001 00000000000000

How to achieve Quantization?

Quantization can be achieved by following primary techniques.

1. Post training quantization (PTQ)

In post training quantization, we quantize the model after actual training is done, i.e we quantize the pretrained model.
We make a copy of actual architecture and add quantize layer to start and dequantize layer to last.
We also add observer in between which help in understanding stats about min-max values in each layer.
We run inference on this quantized architecture, and calculate the parameters like scale, zero-point that helps to quantize the model.
Observation is that the size of model reduces upto 4x, as FP32 is converted to INT8.
Also, speedup is observed as in general computations are faster on integers than floating point values.
Some degradation in performance is expected due to low precision conversion.
There are two types for this quantization:
- a) Static
- b) Dynamic
More details in tutorial notebooks.

3. Quantization Aware training (QAT)

In thi approach, we try to train model with quantization such that it adjusts better with the loss functions.
We quantize and deqauntize between every layer of the network.
For back propogation, we approximate the gradient since this quantization is non-differentiable.
QAT is more robust compared to PTQ, as it learns better during actual model training aiding the losses to converge accordingly.

Tutorial Notebooks

Topic Name	PyTorch Tutorials
Post Training Dynamic Quantization	PTDQ_Tutorial.ipynb
Post Training Static Quantization	PTSQ_Tutorial.ipynb
Quantization Aware Training	QAT_Tutorial.ipynb
Quantization using Bits and Bytes	bitsandbytes.ipynb

References

Notebooks

Articles

Video

Quantization explained with PyTorch - Post-Training Quantization, Quantization-Aware Training

Presentation

Introduction-to-Quantization.pdf

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
tutorials		tutorials
.gitignore		.gitignore
Introduction-to-Quantization.pdf		Introduction-to-Quantization.pdf
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

What is Quantization?

Pros

Cons

Symetric vs Assymetric Quantization.

Symmetric quantization

Asymmetric quantization

How a floating point is represented in 32 bit binary format.

How to achieve Quantization?

1. Post training quantization (PTQ)

3. Quantization Aware training (QAT)

Tutorial Notebooks

References

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

infocusp/model_quantization

Folders and files

Latest commit

History

Repository files navigation

What is Quantization?

Pros

Cons

Symetric vs Assymetric Quantization.

Symmetric quantization

Asymmetric quantization

How a floating point is represented in 32 bit binary format.

How to achieve Quantization?

1. Post training quantization (PTQ)

3. Quantization Aware training (QAT)

Tutorial Notebooks

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages