Update QAT docs, highlight axolotl integration (#2266)

SalmanMohammadi · web-flow · commit d2842e507b0c · 2025-06-10T15:02:49.000-04:00
* updating docs

* updating docs

* updating docs

* updating qat readme
diff --git a/README.md b/README.md
@@ -213,6 +213,7 @@ We're also fortunate to be integrated into some of the leading open-source libra
 4. [TorchTune](https://pytorch.org/torchtune/main/tutorials/qlora_finetune.html?highlight=qlora) for our QLoRA and QAT recipes
 5. VLLM for LLM serving: [usage](https://docs.vllm.ai/en/latest/features/quantization/torchao.html)
 6. SGLang for LLM serving: [usage](https://docs.sglang.ai/backend/server_arguments.html#server-arguments) and the major [PR](https://github.com/sgl-project/sglang/pull/1341).
+7. Axolotl for [QAT](https://docs.axolotl.ai/docs/qat.html) and [PTQ](https://docs.axolotl.ai/docs/quantize.html)
 
 ## Videos
 * [Keynote talk at GPU MODE IRL](https://youtu.be/FH5wiwOyPX4?si=VZK22hHz25GRzBG1&t=1009)
diff --git a/torchao/quantization/qat/README.md b/torchao/quantization/qat/README.md
@@ -115,11 +115,20 @@ To fake quantize embedding in addition to linear, you can additionally call
 the following with a filter function during the prepare step:
 
 ```
-from torchao.quantization.quant_api import _is_linear
+# first apply linear transformation to the model as above
+activation_config = FakeQuantizeConfig(torch.int8, "per_token", is_symmetric=False)
+weight_config = FakeQuantizeConfig(torch.int4, group_size=32)
+quantize_(
+    model,
+    IntXQuantizationAwareTrainingConfig(activation_config, weight_config),
+)
+
+# then apply weight-only transformation to embedding layers
+# activation fake quantization is not supported for embedding layers
 quantize_(
     m,
-    IntXQuantizationAwareTrainingConfig(weight_config=weight_config),
-    filter_fn=lambda m, _: isinstance(m, torch.nn.Embedding) or _is_linear(m),
+    IntXQuantizationAwareTrainingConfig(weight_config=weight_config), 
+    filter_fn=lambda m, _: isinstance(m, torch.nn.Embedding) 
 )
 ```
 
@@ -193,6 +202,19 @@ tune run --nnodes 1 --nproc_per_node 4 qat_lora_finetune_distributed --config ll
 
 For more detail, please refer to [this QAT tutorial](https://pytorch.org/torchtune/main/tutorials/qat_finetune.html).
 
+## Axolotl integration
+
+[Axolotl](https://github.com/axolotl-ai-cloud) uses torchao to support quantized-aware fine-tuning. You can use the following commands to fine-tune, and then quantize a Llama-3.2-3B model:
+
+```bash
+axolotl train examples/llama-3/3b-qat-fsdp2.yaml
+# once training is complete, perform the quantization step
+axolotl quantize examples/llama-3/3b-qat-fsdp2.yaml
+# you should now have a quantized model saved in ./outputs/qat_out/quatized
+```
+
+Please see the [QAT documentation](https://docs.axolotl.ai/docs/qat.html) in axolotl for more details.
+
 ## Evaluation Results
 
 Evaluation was performed on 6-8 A100 GPUs (80GB each) using the torchtune QAT