Update on "Add GPTQQuantizer"

jerryzh168 · jerryzh168 · commit 1553bd4098b2 · 2024-03-12T17:12:48.000-07:00
Summary:
Implement GPTQQuantizer with the unified quantizer API

Test Plan:
python test/quantization/test_quant_api.py

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
diff --git a/README.md b/README.md
@@ -1,8 +1,21 @@
-# torchao
+# torchao: PyTorch Architecture Optimization 
 
-**Note: This repository is currently under heavy development - if you have suggestions on the API or use-cases you'd like to be covered, please open an github issue or reach out. We'd love to hear about how you're using the APIs.**
+**Note: This repository is currently under heavy development - if you have suggestions on the API or use-cases you'd like to be covered, please open an github issue**
+
+The `torchao` package allows you to quantize and prune your models using native PyTorch. 
+
+The repo hosts both
+1. lower precision [dtypes](./torchao/dtypes) such as nf4, uint4
+2. Quantization [algorithms](./torchao/quantization) such as dynamic quant, smoothquant
+3. Sparsity [algorithms](./torchao/sparsity) such as Wanda
+
+## Success stories
+Our kernels have has been used to achieve SOTA inference performance on
+
+1. Image segmentation modelss with [sam-fast](pytorch.org/blog/accelerating-generative-ai)
+2. Language models with [gpt-fast](pytorch.org/blog/accelerating-generative-ai-2)
+3. Diffusion models with [sd-fast](pytorch.org/blog/accelerating-generative-ai-3)
 
-The torchao package contains apis and workflows used to apply AO techniques like quantization and pruning to models using only native pytorch.
 
 ## Installation
 
@@ -18,43 +31,23 @@ pip install torchao
 ```Shell
 git clone https://github.com/pytorch-labs/ao
 cd ao
-python setup.py install
-```
-
-Verify Installation:
-
-```Shell
-pip list | grep torchao
-```
-
-Expected Output
-```Shell
-torchao                            0.0.1                   <install dir>
+pip install -e .
 ```
 
-## Usage
+## Examples
 
-Relevant APIs can be found in torchao.quantization.quant_api
-
-Note: While these techniques are designed to improve model performance, in some cases the opposite can occur.
-This is because quantization adds additional overhead to the model that is hopefully made up for by faster matmuls (dynamic quantization) or loading weights faster (weight-only quantization). If your matmuls are small enough or your non-quantized perf isn't bottlenecked by weight load time, these techniques may reduce performance.
-
-The following apis use quantized [tensor subclasses](https://pytorch.org/docs/stable/notes/extending.html#subclassing-torch-tensor). By taking a linear op/module and replacing the original weight with a q-tensor subclass, we're able to convert it into a quantized version of the op. Upon replacement, these q-tensor subclasses quantize the original weight and override the dispatch for linear ops to instead use the subclass' _quantized_op method.
-
-This tensor subclass method of quantization is preferred over older module swap based methods because it doesn't modify the graph and is generally more composable and flexible.
+Typically quantization algorithms will have different schemes for how the activation and weights are quantized so A16W8 for instance means the activations are quantized to 16 bits wheras the weights are quantized to 8 bits. Trying out different quantization schemes in `torchao` is generally a 1 line change.
 
 ### A8W8 Dynamic Quantization
 
-The `change_linear_weights_to_int8_dqtensors` function converts the linear weights in a model to a quantized tensor subclass `Int8DynamicallyQuantizedLinearWeight`. In practice this
-converts the floating point linear matmul of the original linear op to a dynamically quantized linear matmul.
-
-Example
-
 ```Python
 import torch
 from torchao.quantization import quant_api
 
-# some user model and example input
+# Fuse the int8*int8 -> int32 matmul and subsequent mul op avoiding materialization of the int32 intermediary tensor
+torch._inductor.config.force_fuse_int_mm_with_mul = True
+
+# Plug in your model and example input
 model = torch.nn.Sequential(torch.nn.Linear(32, 64)).cuda().to(torch.bfloat16)
 input = torch.randn(32,32, dtype=torch.bfloat16, device='cuda')
 
@@ -66,78 +59,54 @@ model = torch.compile(model, mode='max-autotune')
 model(input)
 ```
 
-This technique works best when the torch._inductor.config.force_fuse_int_mm_with_mul option is enabled. This allows fusion of the int8*int8 -> int32 matmul and subsequent mul op, thereby avoiding materialization of the int32 intermediary tensor.
-
-
 ### A16W8 WeightOnly Quantization
 
-The `change_linear_weights_to_int8_woqtensors` function converts the linear weights in a model to a quantized tensor subclass `Int8WeightOnlyQuantizedLinearWeight`. In practice this
-converts the floating point linear matmul of the original linear op to a weight only quantized linear matmul
-
-Example
-
-```Python
-# some user model and example input
-...
-
-# convert linear modules to quantized linear modules
+```python
 quant_api.change_linear_weights_to_int8_woqtensors(model)
-
-# compile the model to improve performance
-...
 ```
 
 This technique works best when the torch._inductor.config.use_mixed_mm option is enabled. This avoids dequantizing the weight tensor before the matmul, instead fusing the dequantization into the matmul, thereby avoiding materialization of a large floating point weight tensor.
 
 
 ### A16W4 WeightOnly Quantization
 
-The `change_linear_weights_to_int4_woqtensors` function converts the linear weights in a model to a quantized tensor subclass `Int4WeightOnlyQuantizedLinearWeight`. In practice this
-converts the floating point linear matmul of the original linear op to a weight only quantized linear matmul
-
-Example
-
-```Python
-# some user model and example input
-...
-
-# convert linear modules to quantized linear modules
+```python
 quant_api.change_linear_weights_to_int4_woqtensors(model)
-
-# compile the model to improve performance
-...
 ```
 
-The quantization error incurred by applying int4 quantization to your model can be fairly significant, so using external techniques like GPTQ may be necessary to obtain a usable model.
-
-## Other APIs
+Note: The quantization error incurred by applying int4 quantization to your model can be fairly significant, so using external techniques like GPTQ may be necessary to obtain a usable model.
 
-### Module Swap APIs
-
-The `apply_dynamic_quant` and `apply_weight_only_int8_quant` apis can be used in the same formula as above to achieve dynamic and weight-only quantization using module swaps instead of quantized tensor subclasses.
 
 ### A8W8 Dynamic Quantization with Smoothquant
 
-We've also implemented a version of [smoothquant](https://arxiv.org/abs/2211.10438) with the same GEMM format as above.
-Due to requiring calibration, the API is slightly more complicated and currently only exists with a module swap api.
+We've also implemented a version of [smoothquant](https://arxiv.org/abs/2211.10438) with the same GEMM format as above. Due to requiring calibration, the API is more complicated.
 
 Example
 
 ```Python
 import torch
 from torchao.quantization.smoothquant import swap_linear_with_smooth_fq_linear, smooth_fq_linear_to_inference
 
-# some user model
+# Fuse the int8*int8 -> int32 matmul and subsequent mul op avoiding materialization of the int32 intermediary tensor
+torch._inductor.config.force_fuse_int_mm_with_mul = True
+
+# plug in your model
 model = get_model()
 
 # convert linear modules to smoothquant
 # linear module in calibration mode
 swap_linear_with_smooth_fq_linear(model)
 
-# calibration
-for i in range(calibration_amount):
-    input = get_input()
-    model(input)
+# Create a data loader for calibration
+calibration_data = get_calibration_data()
+calibration_dataset = MyDataset(calibration_data)
+calibration_loader = DataLoader(calibration_dataset, batch_size=32, shuffle=True)
+
+# Calibrate the model
+model.train()
+for batch in calibration_loader:
+    inputs = batch
+    model(inputs)
 
 # set it to inference mode
 smooth_fq_linear_to_inference(model)
@@ -147,7 +116,11 @@ model = torch.compile(model, mode='max-autotune')
 model(input)
 ```
 
-like the other dynamic quantization apis, the torch._inductor.config.force_fuse_int_mm_with_mul option may significantly improve performance if enabled.
+## Sharp edges
+
+1. While these techniques are designed to improve model performance, in some cases the opposite can occur. This is because quantization adds additional overhead to the model that is hopefully made up for by faster matmuls (dynamic quantization) or loading weights faster (weight-only quantization). If your matmuls are small enough or your non-quantized perf isn't bottlenecked by weight load time, these techniques may reduce performance.
+2. Use the PyTorch nightlies so you can leverage [tensor subclasses](https://pytorch.org/docs/stable/notes/extending.html#subclassing-torch-tensor) which is preferred over older module swap based methods because it doesn't modify the graph and is generally more composable and flexible.
+
 
 ## License
 
diff --git a/test/dtypes/test_uint4.py b/test/dtypes/test_uint4.py
@@ -18,7 +18,7 @@
     compute_error,
 )
 from torchao.quantization.quant_api import (
-    replace_with_custom_fn_if_matches_filter,
+    _replace_with_custom_fn_if_matches_filter,
 )
 from torch.ao.quantization.observer import ObserverBase
 from torch import nn
@@ -36,7 +36,7 @@ def fn(mod):
         mod.weight = torch.nn.Parameter(PerChannelSymmetricWeightUInt4Tensor.from_float(mod.weight), requires_grad=False)
         return mod
 
-    replace_with_custom_fn_if_matches_filter(
+    _replace_with_custom_fn_if_matches_filter(
         model,
         lambda mod: fn(mod),
         lambda mod, fqn: isinstance(mod, torch.nn.Linear),
diff --git a/test/modules/test_nf4_linear.py b/test/modules/test_nf4_linear.py
@@ -4,9 +4,10 @@
 import torch
 from torch import nn
 from torch.testing._internal.common_utils import TestCase
-from torchao.dtypes.nf4tensor import linear_nf4, NF4Tensor
+from torchao.dtypes.nf4tensor import linear_nf4, NF4Tensor, to_nf4
 import torch.nn.functional as F
-
+import io
+from collections import OrderedDict
 
 bnb_available = False
 
@@ -44,11 +45,19 @@ def _build_bnb_linear(input_weight, device):
 
 
 class TestNF4Linear(TestCase):
+    class TestMod(nn.Module):
+        def __init__(self, tensor, block_size, scaler_block_size):
+            super().__init__()
+            self.param = torch.nn.Parameter(to_nf4(tensor, block_size, scaler_block_size))
+
+    def save_state_dict_to_buffer(self, state_dict: OrderedDict):
+        buffer = io.BytesIO()
+        torch.save(state_dict, buffer)
+        buffer.seek(0)
+        return buffer
 
     def test_register_nf4_as_param(self):
-        nf4_tensor = NF4Tensor.from_tensor(
-            inpt_tensor=torch.randn(512, 512, dtype=torch.bfloat16)
-        )
+        nf4_tensor = to_nf4(torch.randn(512, 512, dtype=torch.bfloat16))
 
         # Would raise if nn.Parameter registration fails, such as no detach()
         # impl when calling __torch_dispatch__
@@ -58,18 +67,14 @@ def test_register_nf4_as_param(self):
     def test_output_bf16(self):
         # Test to ensure W4 A16 produces A16
         inp = torch.randn(2, 512, dtype=torch.bfloat16, requires_grad=True)
-        nf4_tensor = NF4Tensor.from_tensor(
-            inpt_tensor=torch.randn(512, 512, dtype=torch.bfloat16)
-        )
+        nf4_tensor = to_nf4(torch.randn(512, 512, dtype=torch.bfloat16))
         out = linear_nf4(input=inp, weight=nf4_tensor)
         assert out.dtype == torch.bfloat16
 
     def test_backward_bf16(self):
         # Test to ensure backward pass gives activation a bf16 gradient and no gradient
         # to the linear's weight, as it is frozen.
-        nf4_tensor = NF4Tensor.from_tensor(
-            inpt_tensor=torch.randn(512, 512, dtype=torch.bfloat16)
-        )
+        nf4_tensor = to_nf4(torch.randn(512, 512, dtype=torch.bfloat16))
         inp = torch.randn(2, 512, dtype=torch.bfloat16, requires_grad=True)
         linear_nf4(inp, nf4_tensor).sum().backward()
         assert inp.grad is not None and inp.grad.dtype == torch.bfloat16
@@ -83,7 +88,7 @@ def test_reconstruction_qlora_vs_bnb(self):
         device = "cuda"
         embed_dim = 512
         input_weight = _build_input_weight(embed_dim, device)
-        nf4_weight = NF4Tensor.from_tensor(input_weight)
+        nf4_weight = to_nf4(input_weight)
         bnb_linear = _build_bnb_linear(input_weight, device)
         bnb_reconstruction = bnb_linear(
             torch.eye(embed_dim, embed_dim, dtype=torch.bfloat16, device=device)
@@ -107,7 +112,7 @@ def test_nf4_bnb_linear(self):
         dim = 512
         device = "cuda"
         input_weight = _build_input_weight(dim, device)
-        nf4_weight = NF4Tensor.from_tensor(input_weight)
+        nf4_weight = to_nf4(input_weight)
         bnb_linear = _build_bnb_linear(input_weight, device)
 
         inp = torch.randn(2, 512, dtype=torch.bfloat16, device="cuda")
@@ -121,6 +126,56 @@ def test_nf4_bnb_linear(self):
         assert err_native < 0.5 * dim
         assert err_bnb < 0.5 * dim
 
+    @unittest.skipIf(not torch.cuda.is_available(), "Need cuda for test")
+    def test_load_from_bfloat16(self):
+        """Tests loading to and from different module state dicts"""
+        inpt_tensor = torch.rand(64, device='cuda', dtype=torch.bfloat16)
+        base_mod = self.TestMod(inpt_tensor, 32, 2)
+
+        bf16_dummy_dict = {"param": inpt_tensor}
+        base_mod.load_state_dict(bf16_dummy_dict)
+
+        assert base_mod.param.block_size == 32
+        assert base_mod.param.scaler_block_size == 2
+
+    @unittest.skipIf(not torch.cuda.is_available(), "Need cuda for test")
+    def test_load_from_nf4_same_meta(self):
+        """Tests loading to and from different module state dicts"""
+        inpt_tensor = torch.rand(64, device='cuda', dtype=torch.bfloat16)
+        base_mod = self.TestMod(inpt_tensor, 32, 2)
+        state_dict = base_mod.state_dict()
+        saved_state_dict = self.save_state_dict_to_buffer(state_dict)
+
+        other_mod = self.TestMod(inpt_tensor, 32, 2)
+        other_mod.load_state_dict(torch.load(saved_state_dict))
+        assert other_mod.param.block_size == 32
+        assert other_mod.param.scaler_block_size == 2
+
+    @unittest.skipIf(not torch.cuda.is_available(), "Need cuda for test")
+    def test_load_from_nf4_diff_meta(self):
+        """Tests loading to and from different module state dicts"""
+        inpt_tensor = torch.rand(128, device='cuda', dtype=torch.bfloat16)
+        base_mod = self.TestMod(inpt_tensor, 32, 2)
+        state_dict = base_mod.state_dict()
+        saved_state_dict = self.save_state_dict_to_buffer(state_dict)
+
+        other_mod = self.TestMod(inpt_tensor, 64, 1)
+        other_mod.load_state_dict(torch.load(saved_state_dict))
+        assert other_mod.param.block_size == 64
+        assert other_mod.param.scaler_block_size == 1
+
+    def test_to_copy(self):
+        inpt_tensor = torch.rand(128, device='cpu')
+        inpt_tensor_nf4 = to_nf4(inpt_tensor, 32, 2)
+        inpt_tensor_bfloat16 = inpt_tensor_nf4.to(torch.bfloat16)
+        torch.testing.assert_allclose(inpt_tensor, inpt_tensor_bfloat16, atol=0.13, rtol=0.13)
+
+        if torch.cuda.is_available():
+            inpt_tensor = torch.rand(128, device='cuda')
+            inpt_tensor_nf4 = to_nf4(inpt_tensor, 32, 2)
+            inpt_tensor_bfloat16 = inpt_tensor_nf4.to(torch.bfloat16)
+            torch.testing.assert_allclose(inpt_tensor, inpt_tensor_bfloat16, atol=0.13, rtol=0.13)
+
 
 if __name__ == "__main__":
     unittest.main()
diff --git a/torchao/__init__.py b/torchao/__init__.py
@@ -0,0 +1,5 @@
+from . import dtypes
+
+__all__ = [
+        "dtypes"
+]
diff --git a/torchao/dtypes/__init__.py b/torchao/dtypes/__init__.py
@@ -1,5 +1,8 @@
+from .nf4tensor import NF4Tensor, to_nf4
 from .uint4 import UInt4Tensor
 
 __all__ = [
+    "NF4Tensor",
+    "to_nf4",
     "UInt4Tensor"
 ]
diff --git a/torchao/dtypes/nf4tensor.py b/torchao/dtypes/nf4tensor.py
diff --git a/torchao/quantization/quant_api.py b/torchao/quantization/quant_api.py

Original file line number	Diff line number	Diff line change
`@@ -1,5 +1,8 @@`
	`1`	`+from .nf4tensor import NF4Tensor, to_nf4`
`1`	`2`	`from .uint4 import UInt4Tensor`
`2`	`3`
`3`	`4`	`__all__ = [`
	`5`	`+ "NF4Tensor",`
	`6`	`+ "to_nf4",`
`4`	`7`	`"UInt4Tensor"`
`5`	`8`	`]`