Description
Hi,
I cannot load a W8A8 quantized model with quantization_status: "frozen" using Transformers AutoModelForCausalLM.from_pretrained(). Models with quantization_status: "compressed" load successfully, but FROZEN models fail. I know vllm would success in this case, but just wonder if transformers or other way of loading could achieve it.
Error:
RuntimeError: Error(s) in loading state_dict for Linear:
While copying the parameter named "weight", whose dimensions in the model are torch.Size([3072, 8192]) and whose dimensions in the checkpoint are torch.Size([3072, 8192]), an exception occurred : ('Only Tensors of floating point and complex dtype can require gradients',).
Full Traceback:
Traceback (most recent call last):
File "/home/xinyiade/quan/test.py", line 36, in test_frozen_quantization_model
model = AutoModelForCausalLM.from_pretrained(model_path)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/anaconda/lib/python3.12/site-packages/transformers/models/auto/auto_factory.py", line 571, in from_pretrained
return model_class.from_pretrained(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/anaconda/lib/python3.12/site-packages/transformers/modeling_utils.py", line 309, in _wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/opt/anaconda/lib/python3.12/site-packages/transformers/modeling_utils.py", line 4574, in from_pretrained
) = cls._load_pretrained_model(
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/anaconda/lib/python3.12/site-packages/transformers/modeling_utils.py", line 5031, in _load_pretrained_model
disk_offload_index, cpu_offload_index = _load_state_dict_into_meta_model(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/anaconda/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/opt/anaconda/lib/python3.12/site-packages/transformers/modeling_utils.py", line 843, in _load_state_dict_into_meta_model
_load_parameter_into_model(model, param_name, param.to(param_device))
File "/opt/anaconda/lib/python3.12/site-packages/transformers/modeling_utils.py", line 731, in _load_parameter_into_model
module.load_state_dict({param_type: tensor}, strict=False, assign=True)
File "/opt/anaconda/lib/python3.12/site-packages/torch/nn/modules/module.py", line 2593, in load_state_dict
raise RuntimeError(
Configuration Comparison:
Working COMPRESSED model:
{
"quantization_config": {
"quantization_status": "compressed",
"config_groups": {
"group_0": {
"input_activations": {
"dynamic": true,
"num_bits": 8,
"observer": null,
"strategy": "token",
"symmetric": true,
"type": "int"
}
}
}
}
}
Failing FROZEN model:
{
"quantization_config": {
"quantization_status": "frozen",
"config_groups": {
"group_0": {
"input_activations": {
"dynamic": true,
"num_bits": 8,
"observer": "memoryless",
"strategy": "token",
"symmetric": true,
"type": "int"
}
}
}
}
}
Reproduction Code:
from transformers import AutoModelForCausalLM
import torch
This fails with FROZEN models
model = AutoModelForCausalLM.from_pretrained(
"path/to/frozen/w8a8/model",
torch_dtype=torch.float16
)
Issue 2: Where is A8 activation quantization implemented in the codebase?
Problem:
I'm trying to understand where and when activation quantization to INT8 actually happens in compressed-tensors. Looking at the code, I have some confusion.
My Observations:
- In CompressedLinear.forward() from compressed_tensors/linear/compressed_linear.py:
def forward(self, input: Tensor) -> Tensor:
if self.quantization_status == QuantizationStatus.COMPRESSED:
weight_data = self.compressor.decompress_module(self)
# ...
self.quantization_status = QuantizationStatus.FROZEN
return linear(input, self.weight, self.bias) # No activation quantization here?
Question: This only handles weight decompression, but I don't see any activation quantization. Where does the A8 (INT8 activation) happen?
- Configuration says dynamic activation quantization:
"input_activations": {
"dynamic": true,
"num_bits": 8,
"observer": "memoryless",
"type": "int"
}
Question: With this configuration, when and where are activations actually quantized to INT8?
Specific Questions:
-
Is activation quantization implemented in the forward pass? If so, which file/function?
-
Does wrap_module_forward_quantized() handle activation quantization? I see this function in compressed_tensors/quantization/lifecycle/forward.py, but I'm not sure if it's being used.
-
Where is the "memoryless" observer implemented? How does it perform dynamic quantization?
-
In a FROZEN W8A8 model, are activations actually converted to INT8 dtype during inference? Or is the quantization happening at a lower level (CUDA kernels)?
Code Locations I've Checked:
- compressed_tensors/linear/compressed_linear.py - Only weight handling
- compressed_tensors/quantization/lifecycle/forward.py - Has forward_quantize() but not sure if it's used
- compressed_tensors/quantization/observers/ - Haven't found memoryless implementation
Could you point me to:
- The specific file/function where activation quantization to INT8 happens
- How to verify that activations are actually being quantized during inference
- The difference in activation handling between COMPRESSED and FROZEN models
- Successfully load and run FROZEN W8A8 models
Any guidance on the intended workflow for FROZEN models and pointers to the activation quantization implementation would be greatly appreciated!
Thank you for your time and for this excellent quantization library!