pytorch
diff --git a/‎README.md
Lines changed: 33 additions & 6 deletions b/‎README.md
Lines changed: 33 additions & 6 deletions
diff --git a/‎scripts/hf_eval.py
Lines changed: 0 additions & 253 deletions b/‎scripts/hf_eval.py
Lines changed: 0 additions & 253 deletions
@@ -21,13 +21,13 @@ torchao just works with `torch.compile()` and `FSDP2` over most PyTorch models o
 
 ### Post Training Quantization
 
-Quantizing and Sparsifying your models is a 1 liner that should work on any model with an `nn.Linear` including your favorite HuggingFace model. You can find a more comprehensive usage instructions [here](torchao/quantization/), sparsity [here](/torchao/_models/sam/README.md) and a HuggingFace inference example [here](scripts/hf_eval.py)
+Quantizing and Sparsifying your models is a 1 liner that should work on any model with an `nn.Linear` including your favorite HuggingFace model.
 
-For inference, we have the option of
-1. Quantize only the weights: works best for memory bound models
-2. Quantize the weights and activations: works best for compute bound models
-2. Quantize the activations and weights and sparsify the weight
+There are 2 methods of post-training quantization, shown in the code snippets below:
+1. Using torchao APIs directly.
+2. Loading a huggingface model with a quantization config.
 
+#### Quantizing for inference with torchao APIs
 ```python
 from torchao.quantization.quant_api import (
     quantize_,
@@ -38,6 +38,17 @@ from torchao.quantization.quant_api import (
 quantize_(m, Int4WeightOnlyConfig())
 ```
 
+You can find a more comprehensive usage instructions for quantization [here](torchao/quantization/) and for sparsity [here](/torchao/_models/sam/README.md).
+
+#### Quantizing for inference with huggingface configs
+
+See [docs](https://huggingface.co/docs/transformers/main/en/quantization/torchao) for more details.
+
+For inference, we have the option of
+1. Quantize only the weights: works best for memory bound models
+2. Quantize the weights and activations: works best for compute bound models
+2. Quantize the activations and weights and sparsify the weight
+
 For gpt-fast `Int4WeightOnlyConfig()` is the best option at bs=1 as it **2x the tok/s and reduces the VRAM requirements by about 65%** over a torch.compiled baseline.
 
 If you don't have enough VRAM to quantize your entire model on GPU and you find CPU quantization to be too slow then you can use the device argument like so `quantize_(model, Int8WeightOnlyConfig(), device="cuda")` which will send and quantize each layer individually to your GPU.
@@ -50,6 +61,22 @@ model = torchao.autoquant(torch.compile(model, mode='max-autotune'))
 
 We also provide a developer facing API so you can implement your own quantization algorithms so please use the excellent [HQQ](https://github.com/pytorch/ao/tree/main/torchao/prototype/hqq) algorithm as a motivating example.
 
+### Evaluation
+
+You can also use the EleutherAI [LM evaluation harness](https://github.com/EleutherAI/lm-evaluation-harness) to directly evaluate models
+quantized with post training quantization, by following these steps:
+
+1. Quantize your model with a [post training quantization strategy](#post-training-quantization).
+2. Save your model to disk or upload to huggingface hub ([instructions]( https://huggingface.co/docs/transformers/main/en/quantization/torchao?torchao=manual#serialization)).
+3. [Install](https://github.com/EleutherAI/lm-evaluation-harness?tab=readme-ov-file#install) lm-eval.
+4. Run an evaluation. Example:
+
+```bash
+lm_eval --model hf --model_args pretrained=${HF_USER}/${MODEL_ID} --tasks hellaswag --device cuda:0 --batch_size 8
+```
+
+Check out the lm-eval [usage docs](https://github.com/EleutherAI/lm-evaluation-harness?tab=readme-ov-file#basic-usage) for more details.
+
 ### KV Cache Quantization
 
 We've added kv cache quantization and other features in order to enable long context length (and necessarily memory efficient) inference.
@@ -99,7 +126,7 @@ from torchao.float8 import convert_to_float8_training
 convert_to_float8_training(m, module_filter_fn=...)
 ```
 
-And for an end-to-minimal training recipe of pretraining with float8, you can check out [torchtitan](https://github.com/pytorch/torchtitan/blob/main/docs/float8.md). 
+And for an end-to-minimal training recipe of pretraining with float8, you can check out [torchtitan](https://github.com/pytorch/torchtitan/blob/main/docs/float8.md).
 
 #### Blog posts about float8 training