Skip to content

Commit d2842e5

Browse files
Update QAT docs, highlight axolotl integration (#2266)
* updating docs * updating docs * updating docs * updating qat readme
1 parent a581609 commit d2842e5

File tree

2 files changed

+26
-3
lines changed

2 files changed

+26
-3
lines changed

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -213,6 +213,7 @@ We're also fortunate to be integrated into some of the leading open-source libra
213213
4. [TorchTune](https://pytorch.org/torchtune/main/tutorials/qlora_finetune.html?highlight=qlora) for our QLoRA and QAT recipes
214214
5. VLLM for LLM serving: [usage](https://docs.vllm.ai/en/latest/features/quantization/torchao.html)
215215
6. SGLang for LLM serving: [usage](https://docs.sglang.ai/backend/server_arguments.html#server-arguments) and the major [PR](https://github.com/sgl-project/sglang/pull/1341).
216+
7. Axolotl for [QAT](https://docs.axolotl.ai/docs/qat.html) and [PTQ](https://docs.axolotl.ai/docs/quantize.html)
216217

217218
## Videos
218219
* [Keynote talk at GPU MODE IRL](https://youtu.be/FH5wiwOyPX4?si=VZK22hHz25GRzBG1&t=1009)

torchao/quantization/qat/README.md

Lines changed: 25 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -115,11 +115,20 @@ To fake quantize embedding in addition to linear, you can additionally call
115115
the following with a filter function during the prepare step:
116116

117117
```
118-
from torchao.quantization.quant_api import _is_linear
118+
# first apply linear transformation to the model as above
119+
activation_config = FakeQuantizeConfig(torch.int8, "per_token", is_symmetric=False)
120+
weight_config = FakeQuantizeConfig(torch.int4, group_size=32)
121+
quantize_(
122+
model,
123+
IntXQuantizationAwareTrainingConfig(activation_config, weight_config),
124+
)
125+
126+
# then apply weight-only transformation to embedding layers
127+
# activation fake quantization is not supported for embedding layers
119128
quantize_(
120129
m,
121-
IntXQuantizationAwareTrainingConfig(weight_config=weight_config),
122-
filter_fn=lambda m, _: isinstance(m, torch.nn.Embedding) or _is_linear(m),
130+
IntXQuantizationAwareTrainingConfig(weight_config=weight_config),
131+
filter_fn=lambda m, _: isinstance(m, torch.nn.Embedding)
123132
)
124133
```
125134

@@ -193,6 +202,19 @@ tune run --nnodes 1 --nproc_per_node 4 qat_lora_finetune_distributed --config ll
193202

194203
For more detail, please refer to [this QAT tutorial](https://pytorch.org/torchtune/main/tutorials/qat_finetune.html).
195204

205+
## Axolotl integration
206+
207+
[Axolotl](https://github.com/axolotl-ai-cloud) uses torchao to support quantized-aware fine-tuning. You can use the following commands to fine-tune, and then quantize a Llama-3.2-3B model:
208+
209+
```bash
210+
axolotl train examples/llama-3/3b-qat-fsdp2.yaml
211+
# once training is complete, perform the quantization step
212+
axolotl quantize examples/llama-3/3b-qat-fsdp2.yaml
213+
# you should now have a quantized model saved in ./outputs/qat_out/quatized
214+
```
215+
216+
Please see the [QAT documentation](https://docs.axolotl.ai/docs/qat.html) in axolotl for more details.
217+
196218
## Evaluation Results
197219

198220
Evaluation was performed on 6-8 A100 GPUs (80GB each) using the torchtune QAT

0 commit comments

Comments
 (0)