W8A8 quantization FROZEN model loading error + A8 activation quantization implementation

Hi, 

I cannot load a W8A8 quantized model with quantization_status: "frozen" using Transformers AutoModelForCausalLM.from_pretrained(). Models with quantization_status: "compressed" load successfully, but FROZEN models fail. I know vllm would success in this case, but just wonder if transformers or other way of loading could achieve it.

Error:
RuntimeError: Error(s) in loading state_dict for Linear:
        While copying the parameter named "weight", whose dimensions in the model are torch.Size([3072, 8192]) and whose dimensions in the checkpoint are torch.Size([3072, 8192]), an exception occurred : ('Only Tensors of floating point and complex dtype can require gradients',).

Full Traceback:
Traceback (most recent call last):
  File "/home/xinyiade/quan/test.py", line 36, in test_frozen_quantization_model
    model = AutoModelForCausalLM.from_pretrained(model_path)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda/lib/python3.12/site-packages/transformers/models/auto/auto_factory.py", line 571, in from_pretrained
    return model_class.from_pretrained(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda/lib/python3.12/site-packages/transformers/modeling_utils.py", line 309, in _wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda/lib/python3.12/site-packages/transformers/modeling_utils.py", line 4574, in from_pretrained
    ) = cls._load_pretrained_model(
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda/lib/python3.12/site-packages/transformers/modeling_utils.py", line 5031, in _load_pretrained_model
    disk_offload_index, cpu_offload_index = _load_state_dict_into_meta_model(
                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda/lib/python3.12/site-packages/transformers/modeling_utils.py", line 843, in _load_state_dict_into_meta_model
    _load_parameter_into_model(model, param_name, param.to(param_device))
  File "/opt/anaconda/lib/python3.12/site-packages/transformers/modeling_utils.py", line 731, in _load_parameter_into_model
    module.load_state_dict({param_type: tensor}, strict=False, assign=True)
  File "/opt/anaconda/lib/python3.12/site-packages/torch/nn/modules/module.py", line 2593, in load_state_dict
    raise RuntimeError(

Configuration Comparison:

Working COMPRESSED model:
{
  "quantization_config": {
    "quantization_status": "compressed",
    "config_groups": {
      "group_0": {
        "input_activations": {
          "dynamic": true,
          "num_bits": 8,
          "observer": null,
          "strategy": "token",
          "symmetric": true,
          "type": "int"
        }
      }
    }
  }
}

Failing FROZEN model:
{
  "quantization_config": {
    "quantization_status": "frozen",
    "config_groups": {
      "group_0": {
        "input_activations": {
          "dynamic": true,
          "num_bits": 8,
          "observer": "memoryless",
          "strategy": "token",
          "symmetric": true,
          "type": "int"
        }
      }
    }
  }
}

Reproduction Code:
from transformers import AutoModelForCausalLM
import torch

# This fails with FROZEN models
model = AutoModelForCausalLM.from_pretrained(
    "path/to/frozen/w8a8/model",
    torch_dtype=torch.float16
)

**Issue 2: Where is A8 activation quantization implemented in the codebase?**

Problem:
I'm trying to understand where and when activation quantization to INT8 actually happens in compressed-tensors. Looking at the code, I have some confusion.

My Observations:

1. In CompressedLinear.forward() from compressed_tensors/linear/compressed_linear.py:

def forward(self, input: Tensor) -> Tensor:
    if self.quantization_status == QuantizationStatus.COMPRESSED:
        weight_data = self.compressor.decompress_module(self)
        # ...
        self.quantization_status = QuantizationStatus.FROZEN
    return linear(input, self.weight, self.bias)  # No activation quantization here?

Question: This only handles weight decompression, but I don't see any activation quantization. Where does the A8 (INT8 activation) happen?

2. Configuration says dynamic activation quantization:
"input_activations": {
  "dynamic": true,
  "num_bits": 8,
  "observer": "memoryless",
  "type": "int"
}

Question: With this configuration, when and where are activations actually quantized to INT8?

Specific Questions:

1. Is activation quantization implemented in the forward pass? If so, which file/function?

2. Does wrap_module_forward_quantized() handle activation quantization? I see this function in compressed_tensors/quantization/lifecycle/forward.py, but I'm not sure if it's being used.

3. Where is the "memoryless" observer implemented? How does it perform dynamic quantization?

4. In a FROZEN W8A8 model, are activations actually converted to INT8 dtype during inference? Or is the quantization happening at a lower level (CUDA kernels)?

Code Locations I've Checked:
- compressed_tensors/linear/compressed_linear.py - Only weight handling
- compressed_tensors/quantization/lifecycle/forward.py - Has forward_quantize() but not sure if it's used
- compressed_tensors/quantization/observers/ - Haven't found memoryless implementation

Could you point me to:
1. The specific file/function where activation quantization to INT8 happens
2. How to verify that activations are actually being quantized during inference
3. The difference in activation handling between COMPRESSED and FROZEN models
4. Successfully load and run FROZEN W8A8 models


Any guidance on the intended workflow for FROZEN models and pointers to the activation quantization implementation would be greatly appreciated!


Thank you for your time and for this excellent quantization library!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

W8A8 quantization FROZEN model loading error + A8 activation quantization implementation #378

This fails with FROZEN models

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

W8A8 quantization FROZEN model loading error + A8 activation quantization implementation #378

Description

This fails with FROZEN models

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions