Quant doc updates (#12240)

metascroy · web-flow · commit 8497ea7d9b77 · 2025-07-07T17:40:22.000-07:00
Initial draft of quantization doc updates (#10603)
diff --git a/docs/source/backend-template.md b/docs/source/backend-template.md
@@ -32,6 +32,8 @@ What quantization schemes does this backend support? Consider including the foll
 - Symmetric vs asymmetric weights?
 - Per-tensor, per-chanel, group/blockwise?
 
+If using a PT2E quantizer, document how to initialize the quantizer and all relevant configs and options.
+
 Include a code snippet demonstrating how to perform quantization for this backend. Document, or link to, a description of the parameters that the user can specify.
 
 ## Runtime Integration
diff --git a/docs/source/backends-coreml.md b/docs/source/backends-coreml.md
@@ -86,12 +86,13 @@ To quantize a PyTorch model for the CoreML backend, use the `CoreMLQuantizer`. `
 
 ### 8-bit Quantization using the PT2E Flow
 
+Quantization with the CoreML backend requires exporting the model for iOS17 or later.
 To perform 8-bit quantization with the PT2E flow, perform the following steps:
 
 1) Define [coremltools.optimize.torch.quantization.LinearQuantizerConfig](https://apple.github.io/coremltools/source/coremltools.optimize.torch.quantization.html#coremltools.optimize.torch.quantization.LinearQuantizerConfig) and use to to create an instance of a `CoreMLQuantizer`.
 2) Use `torch.export.export_for_training` to export a graph module that will be prepared for quantization.
 3) Call `prepare_pt2e` to prepare the model for quantization.
-4) For static quantization, run the prepared model with representative samples to calibrate the quantizated tensor activation ranges.
+4) Run the prepared model with representative samples to calibrate the quantizated tensor activation ranges.
 5) Call `convert_pt2e` to quantize the model.
 6) Export and lower the model using the standard flow.
 
@@ -152,7 +153,9 @@ et_program = to_edge_transform_and_lower(
 ).to_executorch()
 ```
 
-The above does static quantization (activations and weights are quantized).  Quantizing activations requires calibrating the model on representative data.  You can also do weight-only quantization, which does not require calibration data, by specifying the activation_dtype to be torch.float32:
+The above does static quantization (activations and weights are quantized).
+
+You can see a full description of available quantization configs in the [coremltools documentation](https://apple.github.io/coremltools/source/coremltools.optimize.torch.quantization.html#coremltools.optimize.torch.quantization.LinearQuantizerConfig).  For example, the config below will perform weight-only quantization:
 
 ```
 weight_only_8bit_config = ct.optimize.torch.quantization.LinearQuantizerConfig(
@@ -164,13 +167,12 @@ weight_only_8bit_config = ct.optimize.torch.quantization.LinearQuantizerConfig(
     )
 )
 quantizer = CoreMLQuantizer(weight_only_8bit_config)
-prepared_model = prepare_pt2e(training_gm, quantizer)
-quantized_model = convert_pt2e(prepared_model)
 ```
 
-Note that static quantization requires exporting the model for iOS17 or later.
+Quantizing activations requires calibrating the model on representative data.  Also note that PT2E currently requires passing at least 1 calibration sample before calling convert_pt2e, even for data-free weight-only quantization.
+
+See [PyTorch 2 Export Post Training Quantization](https://docs.pytorch.org/ao/main/tutorials_source/pt2e_quant_ptq.html) for more information.
 
-See [PyTorch 2 Export Post Training Quantization](https://pytorch.org/tutorials/prototype/pt2e_quant_ptq.html) for more information.
 
 ----
 
@@ -220,7 +222,7 @@ This happens because the model is in FP16, but CoreML interprets some of the arg
 2. coremltools/converters/mil/backend/mil/load.py", line 499, in export
     raise RuntimeError("BlobWriter not loaded")
 
-If you're using Python 3.13, try reducing your python version to Python 3.12.  coremltools does not support Python 3.13, see this [issue](https://github.com/apple/coremltools/issues/2487).  
+If you're using Python 3.13, try reducing your python version to Python 3.12.  coremltools does not support Python 3.13, see this [issue](https://github.com/apple/coremltools/issues/2487).
 
 ### At runtime
 1. [ETCoreMLModelCompiler.mm:55] [Core ML]  Failed to compile model, error = Error Domain=com.apple.mlassetio Code=1 "Failed to parse the model specification. Error: Unable to parse ML Program: at unknown location: Unknown opset 'CoreML7'." UserInfo={NSLocalizedDescription=Failed to par$
diff --git a/docs/source/backends-xnnpack.md b/docs/source/backends-xnnpack.md
@@ -117,7 +117,43 @@ et_program = to_edge_transform_and_lower( # (6)
 ).to_executorch()
 ```
 
-See [PyTorch 2 Export Post Training Quantization](https://pytorch.org/tutorials/prototype/pt2e_quant_ptq.html) for more information.
+See [PyTorch 2 Export Post Training Quantization](https://docs.pytorch.org/ao/main/tutorials_source/pt2e_quant_ptq.html) for more information.
+
+### LLM quantization with quantize_
+
+The XNNPACK backend also supports quantizing models with the [torchao](https://github.com/pytorch/ao) quantize_ API.  This is most commonly used for LLMs, requiring more advanced quantization.  Since quantize_ is not backend aware, it is important to use a config that is compatible with CPU/XNNPACK:
+
+* Quantize embeedings with IntxWeightOnlyConfig (with weight_dtype torch.int2, torch.int4, or torch.int8, using PerGroup or PerAxis granularity)
+* Quantize linear layers with Int8DynamicActivationIntxWeightConfig (with weight_dtype=torch.int4, using PerGroup or PerAxis granularity)
+
+Below is a simple example, but a more detailed tutorial including accuracy evaluation on popular LLM benchmarks can be found in the [torchao documentation](https://docs.pytorch.org/ao/main/serving.html#mobile-deployment-with-executorch).
+
+```python
+from torchao.quantization.granularity import PerGroup, PerAxis
+from torchao.quantization.quant_api import (
+    IntxWeightOnlyConfig,
+    Int8DynamicActivationIntxWeightConfig,
+    quantize_,
+)
+
+# Quantize embeddings with 8-bits, per channel
+embedding_config = IntxWeightOnlyConfig(
+    weight_dtype=torch.int8,
+    granularity=PerAxis(0),
+)
+qunatize_(
+    eager_model,
+    lambda m, fqn: isinstance(m, torch.nn.Embedding),
+)
+
+
+# Quatize linear layers with 8-bit dynamic activations and 4-bit weights
+linear_config = Int8DynamicActivationIntxWeightConfig(
+    weight_dtype=torch.int4,
+    weight_granularity=PerGroup(32),
+)
+quantize_(eager_model, linear_config)
+```
 
 ----
 
diff --git a/docs/source/index.md b/docs/source/index.md
@@ -39,6 +39,7 @@ ExecuTorch provides support for:
 - [Runtime Integration](using-executorch-runtime-integration)
 - [Troubleshooting](using-executorch-troubleshooting)
 - [Building from Source](using-executorch-building-from-source)
+- [Quantization](quantization-overview)
 - [FAQs](using-executorch-faqs)
 #### Examples
 - [Android Demo Apps](https://github.com/pytorch-labs/executorch-examples/tree/main/dl3/android/DeepLabV3Demo#executorch-android-demo-app)
@@ -80,8 +81,6 @@ ExecuTorch provides support for:
 - [Runtime Python API Reference](runtime-python-api-reference)
 - [API Life Cycle](api-life-cycle)
 - [Javadoc](https://pytorch.org/executorch/main/javadoc/)
-#### Quantization
-- [Overview](quantization-overview)
 #### Kernel Library
 - [Overview](kernel-library-overview)
 - [Custom ATen Kernel](kernel-library-custom-aten-kernel)
diff --git a/docs/source/quantization-overview.md b/docs/source/quantization-overview.md
@@ -1,38 +1,72 @@
 # Quantization Overview
-Quantization is a process that reduces the precision of computations and lowers memory footprint in the model. To learn more, please visit the [ExecuTorch concepts page](concepts.md#quantization). This is particularly useful for edge devices including wearables, embedded devices and microcontrollers, which typically have limited resources such as processing power, memory, and battery life. By using quantization, we can make our models more efficient and enable them to run effectively on these devices.
 
-In terms of flow, quantization happens early in the ExecuTorch stack:
+Quantization is a technique that reduces the precision of numbers used in a model’s computations and stored weights—typically from 32-bit floats to 8-bit integers. This reduces the model’s memory footprint, speeds up inference, and lowers power consumption, often with minimal loss in accuracy.
 
-![ExecuTorch Entry Points](_static/img/executorch-entry-points.png)
+Quantization is especially important for deploying models on edge devices such as wearables, embedded systems, and microcontrollers, which often have limited compute, memory, and battery capacity. By quantizing models, we can make them significantly more efficient and suitable for these resource-constrained environments.
 
-A more detailed workflow can be found in the [ExecuTorch tutorial](https://pytorch.org/executorch/main/tutorials/export-to-executorch-tutorial).
 
-Quantization is usually tied to execution backends that have quantized operators implemented. Thus each backend is opinionated about how the model should be quantized, expressed in a backend specific ``Quantizer`` class. ``Quantizer`` provides API for modeling users in terms of how they want their model to be quantized and also passes on the user intention to quantization workflow.
+# Quantization in ExecuTorch
+ExecuTorch uses [torchao](https://github.com/pytorch/ao/tree/main/torchao) as its quantization library. This integration allows ExecuTorch to leverage PyTorch-native tools for preparing, calibrating, and converting quantized models.
 
-Backend developers will need to implement their own ``Quantizer`` to express how different operators or operator patterns are quantized in their backend. This is accomplished via [Annotation API](https://pytorch.org/tutorials/prototype/pt2e_quantizer.html) provided by quantization workflow. Since ``Quantizer`` is also user facing, it will expose specific APIs for modeling users to configure how they want the model to be quantized. Each backend should provide their own API documentation for their ``Quantizer``.
 
-Modeling users will use the ``Quantizer`` specific to their target backend to quantize their model, e.g. ``XNNPACKQuantizer``.
+Quantization in ExecuTorch is backend-specific. Each backend defines how models should be quantized based on its hardware capabilities. Most ExecuTorch backends use the torchao [PT2E quantization](https://docs.pytorch.org/ao/main/tutorials_source/pt2e_quant_ptq.html) flow, which works on models exported with torch.export and enables quantization that is tailored for each backend.
 
-For an example quantization flow with ``XNNPACKQuantizer``, more documentation and tutorials, please see ``Performing Quantization`` section in [ExecuTorch tutorial](https://pytorch.org/executorch/main/tutorials/export-to-executorch-tutorial).
+The PT2E quantization workflow has three main steps:
 
-## Source Quantization: Int8DynActInt4WeightQuantizer
+1. Configure a backend-specific quantizer.
+2. Prepare, calibrate, convert, and evalute the quantized model in PyTorch
+3. Lower the model to the target backend
 
-In addition to export based quantization (described above), ExecuTorch wants to highlight source based quantizations, accomplished via [torchao](https://github.com/pytorch/ao). Unlike export based quantization, source based quantization directly modifies the model prior to export. One specific example is `Int8DynActInt4WeightQuantizer`.
+## 1. Configure a Backend-Specific Quantizer
 
-This scheme represents 4-bit weight quantization with 8-bit dynamic quantization of activation during inference.
+Each backend provides its own quantizer (e.g., XNNPACKQuantizer, CoreMLQuantizer) that defines how quantization should be applied to a model in a way that is compatible with the target hardware.
+These quantizers usually support configs that allow users to specify quantization options such as:
 
-Imported with ``from torchao.quantization.quant_api import Int8DynActInt4WeightQuantizer``, this class uses a quantization instance constructed with a specified dtype precision and groupsize, to mutate a provided ``nn.Module``.
+* Precision (e.g., 8-bit or 4-bit)
+* Quantization type (e.g., dynamic, static, or weight-only quantization)
+* Granularity (e.g., per-tensor, per-channel)
 
-```
-# Source Quant
-from torchao.quantization.quant_api import Int8DynActInt4WeightQuantizer
+Not all quantization options are supported by all backends. Consult backend-specific guides for supported quantization modes and configuration, and how to initialize the backend-specific PT2E quantizer:
+
+* [XNNPACK quantization](backends-xnnpack.md#quantization)
+* [CoreML quantization](backends-coreml.md#quantization)
+
+
+
+## 2. Quantize and evaluate the model
+
+After the backend specific quantizer is defined, the PT2E quantization flow is the same for all backends.  A generic example is provided below, but specific examples are given in backend documentation:
+
+```python
+from torchao.quantization.pt2e.quantize_pt2e import convert_pt2e, prepare_pt2e
+
+training_gm = torch.export.export(model, sample_inputs).module()
 
-model = Int8DynActInt4WeightQuantizer(precision=torch_dtype, groupsize=group_size).quantize(model)
+# Prepare the model for quantization using the backend-specific quantizer instance
+prepared_model = prepare_pt2e(training_gm, quantizer)
 
-# Export to ExecuTorch
-from executorch.exir import to_edge
-from torch.export import export
 
-exported_model = export(model, ...)
-et_program = to_edge(exported_model, ...).to_executorch(...)
+# Calibrate the model on representative data
+for sample in calibration_data:
+	prepared_model(sample)
+
+# Convert the calibrated model to a quantized model
+quantized_model = convert_pt2e(prepared_model)
+```
+
+The quantized_model is a PyTorch model like any other, and can be evaluated on different tasks for accuracy.
+Tasks specific benchmarks are the recommended way to evaluate your quantized model, but as crude alternative you can compare to outputs with the original model using generic error metrics like SQNR:
+
+```python
+from torchao.quantization.utils import compute_error
+out_reference = model(sample)
+out_quantized = quantized_model(sample)
+sqnr = compute_error(out_reference, out_quantized) # SQNR error
 ```
+
+Note that numerics on device can differ those in PyTorch even for unquantized models, and accuracy evaluation can also be done with pybindings or on device.
+
+
+## 3. Lower the model
+
+The final step is to lower the quantized_model to the desired backend, as you would an unquantized one.  See [backend-specific pages](backends-overview.md) for lowering information.