RuntimeError when loading llmcompressor W8A8 quantized model: int8 dtype in weight initialization

I'm trying to load the quantized model `[RedHatAI/Qwen2.5-VL-7B-Instruct-quantized.w8a8](https://huggingface.co/RedHatAI/Qwen2.5-VL-7B-Instruct-quantized.w8a8)` but encountering a dtype compatibility issue during model initialization. The model appears to be quantized using `llmcompressor` with W8A8 quantization scheme.

**Note**: I need to load this model without vLLM because I may need to add custom hooks for my research, so I'm looking for a direct loading method using transformers/llmcompressor.

## Error Message

```python
RuntimeError: expected a floating-point or complex dtype, but got dtype=torch.int8
```

**Full Stack Trace:**
```python
File "/transformers/models/qwen2_5_vl/modeling_qwen2_5_vl.py", line 366, in _init_weights
    module.weight.data.normal_(mean=0.0, std=std)
File "/torch/_refs/__init__.py", line 6214, in normal_
    return normal(mean, std, self.shape, out=self, generator=generator)
...
RuntimeError: expected a floating-point or complex dtype, but got dtype=torch.int8
```

## Traceback

The error occurs during model weight initialization where transformers tries to call `normal_()` on int8 tensors. The `normal_()` function in PyTorch only works with floating-point tensors, but the quantized model contains int8 weights.

**Specific failure point:**
- File: `modeling_qwen2_5_vl.py`, line 366
- Function: `_init_weights()` 
- Operation: `module.weight.data.normal_(mean=0.0, std=std)`
- Issue: Trying to apply normal distribution to int8 tensors

## Model Information

Based on the model's `config.json`:
- **Quantization method**: `compressed-tensors`
- **Format**: `int-quantized` 
- **Scheme**: W8A8 (8-bit weights and activations)
- **Base model**: `Qwen/Qwen2.5-VL-7B-Instruct`
- **Compression ratio**: ~1.2x
- **Ignored layers**: All visual layers (`visual.blocks.*`, `visual.merger.*`, `lm_head`)

## What I've Tried

### 1. llmcompressor methods:
```python
# Method 1: TraceableQwen2_5_VLForConditionalGeneration
from llmcompressor.transformers.tracing import TraceableQwen2_5_VLForConditionalGeneration
model = TraceableQwen2_5_VLForConditionalGeneration.from_pretrained(
    model_path, device_map="auto", torch_dtype="auto", trust_remote_code=True
)

# Method 2: SparseAutoModelForCausalLM  
from llmcompressor.transformers import SparseAutoModelForCausalLM
model = SparseAutoModelForCausalLM.from_pretrained(
    model_path, device_map="auto", torch_dtype="auto", trust_remote_code=True
)
```

### 2. Standard transformers methods:
```python
# Method 3: Various dtype configurations
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,  # Also tried: torch.float16, "auto", None
    trust_remote_code=True,
    device_map="auto"
)

# Method 4: AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(
    model_path, trust_remote_code=True, torch_dtype="auto"
)
```

**All methods fail at the same weight initialization step, so I wonder should the model be loaded with `_fast_init=False` or other special parameters?**

## Additional Observations

1. **Warning about ignored layers**: The loader warns about missing visual layers, but this seems expected since they were ignored during quantization
2. **Model files exist**: The quantized model directory contains the expected `.safetensors` files and configuration
3. **Original model works**: The base `Qwen/Qwen2.5-VL-7B-Instruct` loads and works perfectly

## Environment

- **Python**: 3.10
- **PyTorch**: 2.7.0+cu126
- **Transformers**: 4.52.4
- **LLMCompressor**: 0.6.0
- **Compressed-tensors**: 0.10.2


This model was likely created using llmcompressor's oneshot quantization:
```python
from llmcompressor.modifiers.quantization import GPTQModifier
from llmcompressor.transformers import oneshot

recipe = [
    GPTQModifier(
        targets="Linear",
        scheme="W8A8", 
        sequential_targets=["Qwen2_5_VLDecoderLayer"],
        ignore=["lm_head", "re:visual.*"],
    ),
]
```
If this is more of an llmcompressor-specific model loading issue rather than a transformers compatibility issue, please let me know and I'll file this issue in the llmcompressor repository instead.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RuntimeError when loading llmcompressor W8A8 quantized model: int8 dtype in weight initialization #390

Error Message

Traceback

Model Information

What I've Tried

1. llmcompressor methods:

2. Standard transformers methods:

Additional Observations

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

RuntimeError when loading llmcompressor W8A8 quantized model: int8 dtype in weight initialization #390

Description

Error Message

Traceback

Model Information

What I've Tried

1. llmcompressor methods:

2. Standard transformers methods:

Additional Observations

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions