Skip to content

Commit 8497ea7

Browse files
authored
Quant doc updates (#12240)
Initial draft of quantization doc updates (#10603)
1 parent 97047c0 commit 8497ea7

File tree

5 files changed

+104
-31
lines changed

5 files changed

+104
-31
lines changed

docs/source/backend-template.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,8 @@ What quantization schemes does this backend support? Consider including the foll
3232
- Symmetric vs asymmetric weights?
3333
- Per-tensor, per-chanel, group/blockwise?
3434

35+
If using a PT2E quantizer, document how to initialize the quantizer and all relevant configs and options.
36+
3537
Include a code snippet demonstrating how to perform quantization for this backend. Document, or link to, a description of the parameters that the user can specify.
3638

3739
## Runtime Integration

docs/source/backends-coreml.md

Lines changed: 9 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -86,12 +86,13 @@ To quantize a PyTorch model for the CoreML backend, use the `CoreMLQuantizer`. `
8686

8787
### 8-bit Quantization using the PT2E Flow
8888

89+
Quantization with the CoreML backend requires exporting the model for iOS17 or later.
8990
To perform 8-bit quantization with the PT2E flow, perform the following steps:
9091

9192
1) Define [coremltools.optimize.torch.quantization.LinearQuantizerConfig](https://apple.github.io/coremltools/source/coremltools.optimize.torch.quantization.html#coremltools.optimize.torch.quantization.LinearQuantizerConfig) and use to to create an instance of a `CoreMLQuantizer`.
9293
2) Use `torch.export.export_for_training` to export a graph module that will be prepared for quantization.
9394
3) Call `prepare_pt2e` to prepare the model for quantization.
94-
4) For static quantization, run the prepared model with representative samples to calibrate the quantizated tensor activation ranges.
95+
4) Run the prepared model with representative samples to calibrate the quantizated tensor activation ranges.
9596
5) Call `convert_pt2e` to quantize the model.
9697
6) Export and lower the model using the standard flow.
9798

@@ -152,7 +153,9 @@ et_program = to_edge_transform_and_lower(
152153
).to_executorch()
153154
```
154155

155-
The above does static quantization (activations and weights are quantized). Quantizing activations requires calibrating the model on representative data. You can also do weight-only quantization, which does not require calibration data, by specifying the activation_dtype to be torch.float32:
156+
The above does static quantization (activations and weights are quantized).
157+
158+
You can see a full description of available quantization configs in the [coremltools documentation](https://apple.github.io/coremltools/source/coremltools.optimize.torch.quantization.html#coremltools.optimize.torch.quantization.LinearQuantizerConfig). For example, the config below will perform weight-only quantization:
156159

157160
```
158161
weight_only_8bit_config = ct.optimize.torch.quantization.LinearQuantizerConfig(
@@ -164,13 +167,12 @@ weight_only_8bit_config = ct.optimize.torch.quantization.LinearQuantizerConfig(
164167
)
165168
)
166169
quantizer = CoreMLQuantizer(weight_only_8bit_config)
167-
prepared_model = prepare_pt2e(training_gm, quantizer)
168-
quantized_model = convert_pt2e(prepared_model)
169170
```
170171

171-
Note that static quantization requires exporting the model for iOS17 or later.
172+
Quantizing activations requires calibrating the model on representative data. Also note that PT2E currently requires passing at least 1 calibration sample before calling convert_pt2e, even for data-free weight-only quantization.
173+
174+
See [PyTorch 2 Export Post Training Quantization](https://docs.pytorch.org/ao/main/tutorials_source/pt2e_quant_ptq.html) for more information.
172175

173-
See [PyTorch 2 Export Post Training Quantization](https://pytorch.org/tutorials/prototype/pt2e_quant_ptq.html) for more information.
174176

175177
----
176178

@@ -220,7 +222,7 @@ This happens because the model is in FP16, but CoreML interprets some of the arg
220222
2. coremltools/converters/mil/backend/mil/load.py", line 499, in export
221223
raise RuntimeError("BlobWriter not loaded")
222224

223-
If you're using Python 3.13, try reducing your python version to Python 3.12. coremltools does not support Python 3.13, see this [issue](https://github.com/apple/coremltools/issues/2487).
225+
If you're using Python 3.13, try reducing your python version to Python 3.12. coremltools does not support Python 3.13, see this [issue](https://github.com/apple/coremltools/issues/2487).
224226

225227
### At runtime
226228
1. [ETCoreMLModelCompiler.mm:55] [Core ML] Failed to compile model, error = Error Domain=com.apple.mlassetio Code=1 "Failed to parse the model specification. Error: Unable to parse ML Program: at unknown location: Unknown opset 'CoreML7'." UserInfo={NSLocalizedDescription=Failed to par$

docs/source/backends-xnnpack.md

Lines changed: 37 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -117,7 +117,43 @@ et_program = to_edge_transform_and_lower( # (6)
117117
).to_executorch()
118118
```
119119

120-
See [PyTorch 2 Export Post Training Quantization](https://pytorch.org/tutorials/prototype/pt2e_quant_ptq.html) for more information.
120+
See [PyTorch 2 Export Post Training Quantization](https://docs.pytorch.org/ao/main/tutorials_source/pt2e_quant_ptq.html) for more information.
121+
122+
### LLM quantization with quantize_
123+
124+
The XNNPACK backend also supports quantizing models with the [torchao](https://github.com/pytorch/ao) quantize_ API. This is most commonly used for LLMs, requiring more advanced quantization. Since quantize_ is not backend aware, it is important to use a config that is compatible with CPU/XNNPACK:
125+
126+
* Quantize embeedings with IntxWeightOnlyConfig (with weight_dtype torch.int2, torch.int4, or torch.int8, using PerGroup or PerAxis granularity)
127+
* Quantize linear layers with Int8DynamicActivationIntxWeightConfig (with weight_dtype=torch.int4, using PerGroup or PerAxis granularity)
128+
129+
Below is a simple example, but a more detailed tutorial including accuracy evaluation on popular LLM benchmarks can be found in the [torchao documentation](https://docs.pytorch.org/ao/main/serving.html#mobile-deployment-with-executorch).
130+
131+
```python
132+
from torchao.quantization.granularity import PerGroup, PerAxis
133+
from torchao.quantization.quant_api import (
134+
IntxWeightOnlyConfig,
135+
Int8DynamicActivationIntxWeightConfig,
136+
quantize_,
137+
)
138+
139+
# Quantize embeddings with 8-bits, per channel
140+
embedding_config = IntxWeightOnlyConfig(
141+
weight_dtype=torch.int8,
142+
granularity=PerAxis(0),
143+
)
144+
qunatize_(
145+
eager_model,
146+
lambda m, fqn: isinstance(m, torch.nn.Embedding),
147+
)
148+
149+
150+
# Quatize linear layers with 8-bit dynamic activations and 4-bit weights
151+
linear_config = Int8DynamicActivationIntxWeightConfig(
152+
weight_dtype=torch.int4,
153+
weight_granularity=PerGroup(32),
154+
)
155+
quantize_(eager_model, linear_config)
156+
```
121157

122158
----
123159

docs/source/index.md

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,7 @@ ExecuTorch provides support for:
3939
- [Runtime Integration](using-executorch-runtime-integration)
4040
- [Troubleshooting](using-executorch-troubleshooting)
4141
- [Building from Source](using-executorch-building-from-source)
42+
- [Quantization](quantization-overview)
4243
- [FAQs](using-executorch-faqs)
4344
#### Examples
4445
- [Android Demo Apps](https://github.com/pytorch-labs/executorch-examples/tree/main/dl3/android/DeepLabV3Demo#executorch-android-demo-app)
@@ -80,8 +81,6 @@ ExecuTorch provides support for:
8081
- [Runtime Python API Reference](runtime-python-api-reference)
8182
- [API Life Cycle](api-life-cycle)
8283
- [Javadoc](https://pytorch.org/executorch/main/javadoc/)
83-
#### Quantization
84-
- [Overview](quantization-overview)
8584
#### Kernel Library
8685
- [Overview](kernel-library-overview)
8786
- [Custom ATen Kernel](kernel-library-custom-aten-kernel)

docs/source/quantization-overview.md

Lines changed: 55 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -1,38 +1,72 @@
11
# Quantization Overview
2-
Quantization is a process that reduces the precision of computations and lowers memory footprint in the model. To learn more, please visit the [ExecuTorch concepts page](concepts.md#quantization). This is particularly useful for edge devices including wearables, embedded devices and microcontrollers, which typically have limited resources such as processing power, memory, and battery life. By using quantization, we can make our models more efficient and enable them to run effectively on these devices.
32

4-
In terms of flow, quantization happens early in the ExecuTorch stack:
3+
Quantization is a technique that reduces the precision of numbers used in a model’s computations and stored weights—typically from 32-bit floats to 8-bit integers. This reduces the model’s memory footprint, speeds up inference, and lowers power consumption, often with minimal loss in accuracy.
54

6-
![ExecuTorch Entry Points](_static/img/executorch-entry-points.png)
5+
Quantization is especially important for deploying models on edge devices such as wearables, embedded systems, and microcontrollers, which often have limited compute, memory, and battery capacity. By quantizing models, we can make them significantly more efficient and suitable for these resource-constrained environments.
76

8-
A more detailed workflow can be found in the [ExecuTorch tutorial](https://pytorch.org/executorch/main/tutorials/export-to-executorch-tutorial).
97

10-
Quantization is usually tied to execution backends that have quantized operators implemented. Thus each backend is opinionated about how the model should be quantized, expressed in a backend specific ``Quantizer`` class. ``Quantizer`` provides API for modeling users in terms of how they want their model to be quantized and also passes on the user intention to quantization workflow.
8+
# Quantization in ExecuTorch
9+
ExecuTorch uses [torchao](https://github.com/pytorch/ao/tree/main/torchao) as its quantization library. This integration allows ExecuTorch to leverage PyTorch-native tools for preparing, calibrating, and converting quantized models.
1110

12-
Backend developers will need to implement their own ``Quantizer`` to express how different operators or operator patterns are quantized in their backend. This is accomplished via [Annotation API](https://pytorch.org/tutorials/prototype/pt2e_quantizer.html) provided by quantization workflow. Since ``Quantizer`` is also user facing, it will expose specific APIs for modeling users to configure how they want the model to be quantized. Each backend should provide their own API documentation for their ``Quantizer``.
1311

14-
Modeling users will use the ``Quantizer`` specific to their target backend to quantize their model, e.g. ``XNNPACKQuantizer``.
12+
Quantization in ExecuTorch is backend-specific. Each backend defines how models should be quantized based on its hardware capabilities. Most ExecuTorch backends use the torchao [PT2E quantization](https://docs.pytorch.org/ao/main/tutorials_source/pt2e_quant_ptq.html) flow, which works on models exported with torch.export and enables quantization that is tailored for each backend.
1513

16-
For an example quantization flow with ``XNNPACKQuantizer``, more documentation and tutorials, please see ``Performing Quantization`` section in [ExecuTorch tutorial](https://pytorch.org/executorch/main/tutorials/export-to-executorch-tutorial).
14+
The PT2E quantization workflow has three main steps:
1715

18-
## Source Quantization: Int8DynActInt4WeightQuantizer
16+
1. Configure a backend-specific quantizer.
17+
2. Prepare, calibrate, convert, and evalute the quantized model in PyTorch
18+
3. Lower the model to the target backend
1919

20-
In addition to export based quantization (described above), ExecuTorch wants to highlight source based quantizations, accomplished via [torchao](https://github.com/pytorch/ao). Unlike export based quantization, source based quantization directly modifies the model prior to export. One specific example is `Int8DynActInt4WeightQuantizer`.
20+
## 1. Configure a Backend-Specific Quantizer
2121

22-
This scheme represents 4-bit weight quantization with 8-bit dynamic quantization of activation during inference.
22+
Each backend provides its own quantizer (e.g., XNNPACKQuantizer, CoreMLQuantizer) that defines how quantization should be applied to a model in a way that is compatible with the target hardware.
23+
These quantizers usually support configs that allow users to specify quantization options such as:
2324

24-
Imported with ``from torchao.quantization.quant_api import Int8DynActInt4WeightQuantizer``, this class uses a quantization instance constructed with a specified dtype precision and groupsize, to mutate a provided ``nn.Module``.
25+
* Precision (e.g., 8-bit or 4-bit)
26+
* Quantization type (e.g., dynamic, static, or weight-only quantization)
27+
* Granularity (e.g., per-tensor, per-channel)
2528

26-
```
27-
# Source Quant
28-
from torchao.quantization.quant_api import Int8DynActInt4WeightQuantizer
29+
Not all quantization options are supported by all backends. Consult backend-specific guides for supported quantization modes and configuration, and how to initialize the backend-specific PT2E quantizer:
30+
31+
* [XNNPACK quantization](backends-xnnpack.md#quantization)
32+
* [CoreML quantization](backends-coreml.md#quantization)
33+
34+
35+
36+
## 2. Quantize and evaluate the model
37+
38+
After the backend specific quantizer is defined, the PT2E quantization flow is the same for all backends. A generic example is provided below, but specific examples are given in backend documentation:
39+
40+
```python
41+
from torchao.quantization.pt2e.quantize_pt2e import convert_pt2e, prepare_pt2e
42+
43+
training_gm = torch.export.export(model, sample_inputs).module()
2944

30-
model = Int8DynActInt4WeightQuantizer(precision=torch_dtype, groupsize=group_size).quantize(model)
45+
# Prepare the model for quantization using the backend-specific quantizer instance
46+
prepared_model = prepare_pt2e(training_gm, quantizer)
3147

32-
# Export to ExecuTorch
33-
from executorch.exir import to_edge
34-
from torch.export import export
3548

36-
exported_model = export(model, ...)
37-
et_program = to_edge(exported_model, ...).to_executorch(...)
49+
# Calibrate the model on representative data
50+
for sample in calibration_data:
51+
prepared_model(sample)
52+
53+
# Convert the calibrated model to a quantized model
54+
quantized_model = convert_pt2e(prepared_model)
55+
```
56+
57+
The quantized_model is a PyTorch model like any other, and can be evaluated on different tasks for accuracy.
58+
Tasks specific benchmarks are the recommended way to evaluate your quantized model, but as crude alternative you can compare to outputs with the original model using generic error metrics like SQNR:
59+
60+
```python
61+
from torchao.quantization.utils import compute_error
62+
out_reference = model(sample)
63+
out_quantized = quantized_model(sample)
64+
sqnr = compute_error(out_reference, out_quantized) # SQNR error
3865
```
66+
67+
Note that numerics on device can differ those in PyTorch even for unquantized models, and accuracy evaluation can also be done with pybindings or on device.
68+
69+
70+
## 3. Lower the model
71+
72+
The final step is to lower the quantized_model to the desired backend, as you would an unquantized one. See [backend-specific pages](backends-overview.md) for lowering information.

0 commit comments

Comments
 (0)