Skip to content

Commit 37d161e

Browse files
[BE] Remove hf_eval.py and add documentation on using lm-eval (#2045)
1 parent 7936d0d commit 37d161e

File tree

2 files changed

+33
-259
lines changed

2 files changed

+33
-259
lines changed

README.md

Lines changed: 33 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -21,13 +21,13 @@ torchao just works with `torch.compile()` and `FSDP2` over most PyTorch models o
2121

2222
### Post Training Quantization
2323

24-
Quantizing and Sparsifying your models is a 1 liner that should work on any model with an `nn.Linear` including your favorite HuggingFace model. You can find a more comprehensive usage instructions [here](torchao/quantization/), sparsity [here](/torchao/_models/sam/README.md) and a HuggingFace inference example [here](scripts/hf_eval.py)
24+
Quantizing and Sparsifying your models is a 1 liner that should work on any model with an `nn.Linear` including your favorite HuggingFace model.
2525

26-
For inference, we have the option of
27-
1. Quantize only the weights: works best for memory bound models
28-
2. Quantize the weights and activations: works best for compute bound models
29-
2. Quantize the activations and weights and sparsify the weight
26+
There are 2 methods of post-training quantization, shown in the code snippets below:
27+
1. Using torchao APIs directly.
28+
2. Loading a huggingface model with a quantization config.
3029

30+
#### Quantizing for inference with torchao APIs
3131
```python
3232
from torchao.quantization.quant_api import (
3333
quantize_,
@@ -38,6 +38,17 @@ from torchao.quantization.quant_api import (
3838
quantize_(m, Int4WeightOnlyConfig())
3939
```
4040

41+
You can find a more comprehensive usage instructions for quantization [here](torchao/quantization/) and for sparsity [here](/torchao/_models/sam/README.md).
42+
43+
#### Quantizing for inference with huggingface configs
44+
45+
See [docs](https://huggingface.co/docs/transformers/main/en/quantization/torchao) for more details.
46+
47+
For inference, we have the option of
48+
1. Quantize only the weights: works best for memory bound models
49+
2. Quantize the weights and activations: works best for compute bound models
50+
2. Quantize the activations and weights and sparsify the weight
51+
4152
For gpt-fast `Int4WeightOnlyConfig()` is the best option at bs=1 as it **2x the tok/s and reduces the VRAM requirements by about 65%** over a torch.compiled baseline.
4253

4354
If you don't have enough VRAM to quantize your entire model on GPU and you find CPU quantization to be too slow then you can use the device argument like so `quantize_(model, Int8WeightOnlyConfig(), device="cuda")` which will send and quantize each layer individually to your GPU.
@@ -50,6 +61,22 @@ model = torchao.autoquant(torch.compile(model, mode='max-autotune'))
5061

5162
We also provide a developer facing API so you can implement your own quantization algorithms so please use the excellent [HQQ](https://github.com/pytorch/ao/tree/main/torchao/prototype/hqq) algorithm as a motivating example.
5263

64+
### Evaluation
65+
66+
You can also use the EleutherAI [LM evaluation harness](https://github.com/EleutherAI/lm-evaluation-harness) to directly evaluate models
67+
quantized with post training quantization, by following these steps:
68+
69+
1. Quantize your model with a [post training quantization strategy](#post-training-quantization).
70+
2. Save your model to disk or upload to huggingface hub ([instructions]( https://huggingface.co/docs/transformers/main/en/quantization/torchao?torchao=manual#serialization)).
71+
3. [Install](https://github.com/EleutherAI/lm-evaluation-harness?tab=readme-ov-file#install) lm-eval.
72+
4. Run an evaluation. Example:
73+
74+
```bash
75+
lm_eval --model hf --model_args pretrained=${HF_USER}/${MODEL_ID} --tasks hellaswag --device cuda:0 --batch_size 8
76+
```
77+
78+
Check out the lm-eval [usage docs](https://github.com/EleutherAI/lm-evaluation-harness?tab=readme-ov-file#basic-usage) for more details.
79+
5380
### KV Cache Quantization
5481

5582
We've added kv cache quantization and other features in order to enable long context length (and necessarily memory efficient) inference.
@@ -99,7 +126,7 @@ from torchao.float8 import convert_to_float8_training
99126
convert_to_float8_training(m, module_filter_fn=...)
100127
```
101128

102-
And for an end-to-minimal training recipe of pretraining with float8, you can check out [torchtitan](https://github.com/pytorch/torchtitan/blob/main/docs/float8.md).
129+
And for an end-to-minimal training recipe of pretraining with float8, you can check out [torchtitan](https://github.com/pytorch/torchtitan/blob/main/docs/float8.md).
103130

104131
#### Blog posts about float8 training
105132

scripts/hf_eval.py

Lines changed: 0 additions & 253 deletions
This file was deleted.

0 commit comments

Comments
 (0)