Skip to content

W8A8 quantization FROZEN model loading error + A8 activation quantization implementation #378

Closed
@AdelineXinyi

Description

@AdelineXinyi

Hi,

I cannot load a W8A8 quantized model with quantization_status: "frozen" using Transformers AutoModelForCausalLM.from_pretrained(). Models with quantization_status: "compressed" load successfully, but FROZEN models fail. I know vllm would success in this case, but just wonder if transformers or other way of loading could achieve it.

Error:
RuntimeError: Error(s) in loading state_dict for Linear:
While copying the parameter named "weight", whose dimensions in the model are torch.Size([3072, 8192]) and whose dimensions in the checkpoint are torch.Size([3072, 8192]), an exception occurred : ('Only Tensors of floating point and complex dtype can require gradients',).

Full Traceback:
Traceback (most recent call last):
File "/home/xinyiade/quan/test.py", line 36, in test_frozen_quantization_model
model = AutoModelForCausalLM.from_pretrained(model_path)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/anaconda/lib/python3.12/site-packages/transformers/models/auto/auto_factory.py", line 571, in from_pretrained
return model_class.from_pretrained(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/anaconda/lib/python3.12/site-packages/transformers/modeling_utils.py", line 309, in _wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/opt/anaconda/lib/python3.12/site-packages/transformers/modeling_utils.py", line 4574, in from_pretrained
) = cls._load_pretrained_model(
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/anaconda/lib/python3.12/site-packages/transformers/modeling_utils.py", line 5031, in _load_pretrained_model
disk_offload_index, cpu_offload_index = _load_state_dict_into_meta_model(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/anaconda/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/opt/anaconda/lib/python3.12/site-packages/transformers/modeling_utils.py", line 843, in _load_state_dict_into_meta_model
_load_parameter_into_model(model, param_name, param.to(param_device))
File "/opt/anaconda/lib/python3.12/site-packages/transformers/modeling_utils.py", line 731, in _load_parameter_into_model
module.load_state_dict({param_type: tensor}, strict=False, assign=True)
File "/opt/anaconda/lib/python3.12/site-packages/torch/nn/modules/module.py", line 2593, in load_state_dict
raise RuntimeError(

Configuration Comparison:

Working COMPRESSED model:
{
"quantization_config": {
"quantization_status": "compressed",
"config_groups": {
"group_0": {
"input_activations": {
"dynamic": true,
"num_bits": 8,
"observer": null,
"strategy": "token",
"symmetric": true,
"type": "int"
}
}
}
}
}

Failing FROZEN model:
{
"quantization_config": {
"quantization_status": "frozen",
"config_groups": {
"group_0": {
"input_activations": {
"dynamic": true,
"num_bits": 8,
"observer": "memoryless",
"strategy": "token",
"symmetric": true,
"type": "int"
}
}
}
}
}

Reproduction Code:
from transformers import AutoModelForCausalLM
import torch

This fails with FROZEN models

model = AutoModelForCausalLM.from_pretrained(
"path/to/frozen/w8a8/model",
torch_dtype=torch.float16
)

Issue 2: Where is A8 activation quantization implemented in the codebase?

Problem:
I'm trying to understand where and when activation quantization to INT8 actually happens in compressed-tensors. Looking at the code, I have some confusion.

My Observations:

  1. In CompressedLinear.forward() from compressed_tensors/linear/compressed_linear.py:

def forward(self, input: Tensor) -> Tensor:
if self.quantization_status == QuantizationStatus.COMPRESSED:
weight_data = self.compressor.decompress_module(self)
# ...
self.quantization_status = QuantizationStatus.FROZEN
return linear(input, self.weight, self.bias) # No activation quantization here?

Question: This only handles weight decompression, but I don't see any activation quantization. Where does the A8 (INT8 activation) happen?

  1. Configuration says dynamic activation quantization:
    "input_activations": {
    "dynamic": true,
    "num_bits": 8,
    "observer": "memoryless",
    "type": "int"
    }

Question: With this configuration, when and where are activations actually quantized to INT8?

Specific Questions:

  1. Is activation quantization implemented in the forward pass? If so, which file/function?

  2. Does wrap_module_forward_quantized() handle activation quantization? I see this function in compressed_tensors/quantization/lifecycle/forward.py, but I'm not sure if it's being used.

  3. Where is the "memoryless" observer implemented? How does it perform dynamic quantization?

  4. In a FROZEN W8A8 model, are activations actually converted to INT8 dtype during inference? Or is the quantization happening at a lower level (CUDA kernels)?

Code Locations I've Checked:

  • compressed_tensors/linear/compressed_linear.py - Only weight handling
  • compressed_tensors/quantization/lifecycle/forward.py - Has forward_quantize() but not sure if it's used
  • compressed_tensors/quantization/observers/ - Haven't found memoryless implementation

Could you point me to:

  1. The specific file/function where activation quantization to INT8 happens
  2. How to verify that activations are actually being quantized during inference
  3. The difference in activation handling between COMPRESSED and FROZEN models
  4. Successfully load and run FROZEN W8A8 models

Any guidance on the intended workflow for FROZEN models and pointers to the activation quantization implementation would be greatly appreciated!

Thank you for your time and for this excellent quantization library!

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions