You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+33-6Lines changed: 33 additions & 6 deletions
Original file line number
Diff line number
Diff line change
@@ -21,13 +21,13 @@ torchao just works with `torch.compile()` and `FSDP2` over most PyTorch models o
21
21
22
22
### Post Training Quantization
23
23
24
-
Quantizing and Sparsifying your models is a 1 liner that should work on any model with an `nn.Linear` including your favorite HuggingFace model. You can find a more comprehensive usage instructions [here](torchao/quantization/), sparsity [here](/torchao/_models/sam/README.md) and a HuggingFace inference example [here](scripts/hf_eval.py)
24
+
Quantizing and Sparsifying your models is a 1 liner that should work on any model with an `nn.Linear` including your favorite HuggingFace model.
25
25
26
-
For inference, we have the option of
27
-
1. Quantize only the weights: works best for memory bound models
28
-
2. Quantize the weights and activations: works best for compute bound models
29
-
2. Quantize the activations and weights and sparsify the weight
26
+
There are 2 methods of post-training quantization, shown in the code snippets below:
27
+
1. Using torchao APIs directly.
28
+
2. Loading a huggingface model with a quantization config.
30
29
30
+
#### Quantizing for inference with torchao APIs
31
31
```python
32
32
from torchao.quantization.quant_api import (
33
33
quantize_,
@@ -38,6 +38,17 @@ from torchao.quantization.quant_api import (
38
38
quantize_(m, Int4WeightOnlyConfig())
39
39
```
40
40
41
+
You can find a more comprehensive usage instructions for quantization [here](torchao/quantization/) and for sparsity [here](/torchao/_models/sam/README.md).
42
+
43
+
#### Quantizing for inference with huggingface configs
44
+
45
+
See [docs](https://huggingface.co/docs/transformers/main/en/quantization/torchao) for more details.
46
+
47
+
For inference, we have the option of
48
+
1. Quantize only the weights: works best for memory bound models
49
+
2. Quantize the weights and activations: works best for compute bound models
50
+
2. Quantize the activations and weights and sparsify the weight
51
+
41
52
For gpt-fast `Int4WeightOnlyConfig()` is the best option at bs=1 as it **2x the tok/s and reduces the VRAM requirements by about 65%** over a torch.compiled baseline.
42
53
43
54
If you don't have enough VRAM to quantize your entire model on GPU and you find CPU quantization to be too slow then you can use the device argument like so `quantize_(model, Int8WeightOnlyConfig(), device="cuda")` which will send and quantize each layer individually to your GPU.
@@ -50,6 +61,22 @@ model = torchao.autoquant(torch.compile(model, mode='max-autotune'))
50
61
51
62
We also provide a developer facing API so you can implement your own quantization algorithms so please use the excellent [HQQ](https://github.com/pytorch/ao/tree/main/torchao/prototype/hqq) algorithm as a motivating example.
52
63
64
+
### Evaluation
65
+
66
+
You can also use the EleutherAI [LM evaluation harness](https://github.com/EleutherAI/lm-evaluation-harness) to directly evaluate models
67
+
quantized with post training quantization, by following these steps:
68
+
69
+
1. Quantize your model with a [post training quantization strategy](#post-training-quantization).
70
+
2. Save your model to disk or upload to huggingface hub ([instructions](https://huggingface.co/docs/transformers/main/en/quantization/torchao?torchao=manual#serialization)).
And for an end-to-minimal training recipe of pretraining with float8, you can check out [torchtitan](https://github.com/pytorch/torchtitan/blob/main/docs/float8.md).
129
+
And for an end-to-minimal training recipe of pretraining with float8, you can check out [torchtitan](https://github.com/pytorch/torchtitan/blob/main/docs/float8.md).
0 commit comments