-
Notifications
You must be signed in to change notification settings - Fork 18
Description
I'm trying to load the quantized model [RedHatAI/Qwen2.5-VL-7B-Instruct-quantized.w8a8](https://huggingface.co/RedHatAI/Qwen2.5-VL-7B-Instruct-quantized.w8a8)
but encountering a dtype compatibility issue during model initialization. The model appears to be quantized using llmcompressor
with W8A8 quantization scheme.
Note: I need to load this model without vLLM because I may need to add custom hooks for my research, so I'm looking for a direct loading method using transformers/llmcompressor.
Error Message
RuntimeError: expected a floating-point or complex dtype, but got dtype=torch.int8
Full Stack Trace:
File "/transformers/models/qwen2_5_vl/modeling_qwen2_5_vl.py", line 366, in _init_weights
module.weight.data.normal_(mean=0.0, std=std)
File "/torch/_refs/__init__.py", line 6214, in normal_
return normal(mean, std, self.shape, out=self, generator=generator)
...
RuntimeError: expected a floating-point or complex dtype, but got dtype=torch.int8
Traceback
The error occurs during model weight initialization where transformers tries to call normal_()
on int8 tensors. The normal_()
function in PyTorch only works with floating-point tensors, but the quantized model contains int8 weights.
Specific failure point:
- File:
modeling_qwen2_5_vl.py
, line 366 - Function:
_init_weights()
- Operation:
module.weight.data.normal_(mean=0.0, std=std)
- Issue: Trying to apply normal distribution to int8 tensors
Model Information
Based on the model's config.json
:
- Quantization method:
compressed-tensors
- Format:
int-quantized
- Scheme: W8A8 (8-bit weights and activations)
- Base model:
Qwen/Qwen2.5-VL-7B-Instruct
- Compression ratio: ~1.2x
- Ignored layers: All visual layers (
visual.blocks.*
,visual.merger.*
,lm_head
)
What I've Tried
1. llmcompressor methods:
# Method 1: TraceableQwen2_5_VLForConditionalGeneration
from llmcompressor.transformers.tracing import TraceableQwen2_5_VLForConditionalGeneration
model = TraceableQwen2_5_VLForConditionalGeneration.from_pretrained(
model_path, device_map="auto", torch_dtype="auto", trust_remote_code=True
)
# Method 2: SparseAutoModelForCausalLM
from llmcompressor.transformers import SparseAutoModelForCausalLM
model = SparseAutoModelForCausalLM.from_pretrained(
model_path, device_map="auto", torch_dtype="auto", trust_remote_code=True
)
2. Standard transformers methods:
# Method 3: Various dtype configurations
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
model_path,
torch_dtype=torch.bfloat16, # Also tried: torch.float16, "auto", None
trust_remote_code=True,
device_map="auto"
)
# Method 4: AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(
model_path, trust_remote_code=True, torch_dtype="auto"
)
All methods fail at the same weight initialization step, so I wonder should the model be loaded with _fast_init=False
or other special parameters?
Additional Observations
- Warning about ignored layers: The loader warns about missing visual layers, but this seems expected since they were ignored during quantization
- Model files exist: The quantized model directory contains the expected
.safetensors
files and configuration - Original model works: The base
Qwen/Qwen2.5-VL-7B-Instruct
loads and works perfectly
Environment
- Python: 3.10
- PyTorch: 2.7.0+cu126
- Transformers: 4.52.4
- LLMCompressor: 0.6.0
- Compressed-tensors: 0.10.2
This model was likely created using llmcompressor's oneshot quantization:
from llmcompressor.modifiers.quantization import GPTQModifier
from llmcompressor.transformers import oneshot
recipe = [
GPTQModifier(
targets="Linear",
scheme="W8A8",
sequential_targets=["Qwen2_5_VLDecoderLayer"],
ignore=["lm_head", "re:visual.*"],
),
]
If this is more of an llmcompressor-specific model loading issue rather than a transformers compatibility issue, please let me know and I'll file this issue in the llmcompressor repository instead.