Quantized model fails to load on Windows/Linux

### Backend impacted

The PyTorch implementation

### Operating system

Linux

### Hardware

GPU with CUDA

### Description

When trying to load quantized Moshi models (`kyutai/moshiko-pytorch-q8` or`kyutai/moshika-pytorch-q8`) on both **Windows** and **Docker Compose**, the app crashes during model initialization due to a `meta` device issue.

```
RuntimeError: All input tensors need to be on the same GPU, but found some tensors to not be on a GPU:
[(torch.Size([4096, 4096]), device(type='meta'))]
```

Traceback points to `QLinear`:

```python
CB, SCB, _ = bnbF.int8_vectorwise_quant(weight.data.to(torch.float16))
```

### Platforms Tested

*  Windows 10 (Anaconda, CUDA available)
* Docker Compose (Ubuntu-based container)

### Steps to Reproduce
 Run `python -m moshi.server --hf-repo kyutai/moshiko-pytorch-q8 --device cuda `.


### Suspected Cause

The model is first initialized with fake (meta) tensors, and actual weights are loaded afterward. However, during initialization, replace_linear_with_qlinear() is called, it receives these fake meta tensors and passes them to bitsandbytes quantization routines, which expect real tensors on a valid device (CPU/GPU), not meta.


### Extra information

```[Info] retrieving checkpoint


[Info] loading mimi


[Info] mimi loaded


[Info] loading moshi


Traceback (most recent call last):


File "/usr/local/lib/python3.10/runpy.py", line 196, in _run_module_as_main


return _run_code(code, main_globals, None,


File "/usr/local/lib/python3.10/runpy.py", line 86, in _run_code


exec(code, run_globals)


File "/app/moshi/server.py", line 291, in <module>


main()


File "/app/moshi/server.py", line 237, in main


lm = checkpoint_info.get_moshi(device=args.device, dtype=args.dtype, fuse_lora=args.fuse_lora)


File "/app/moshi/models/loaders.py", line 258, in get_moshi


model = get_moshi_lm(


File "/app/moshi/models/loaders.py", line 357, in get_moshi_lm


model = LMModel(


File "/app/moshi/models/lm.py", line 132, in __init__


self.transformer = StreamingTransformer(


File "/app/moshi/modules/transformer.py", line 861, in __init__


replace_linear_with_qlinear(self.layers[-1])


File "/app/moshi/utils/quantize.py", line 62, in replace_linear_with_qlinear


replace_linear_with_qlinear(child)


File "/app/moshi/utils/quantize.py", line 62, in replace_linear_with_qlinear


replace_linear_with_qlinear(child)


File "/app/moshi/utils/quantize.py", line 52, in replace_linear_with_qlinear


setattr(module, name, QLinear(child))


File "/app/moshi/utils/quantize.py", line 25, in __init__


CB, SCB, _ = bnbF.int8_vectorwise_quant(weight.data.to(torch.float16)) # type: ignore


File "/usr/local/lib/python3.10/site-packages/bitsandbytes/functional.py", line 2772, in int8_vectorwise_quant


is_on_gpu([A])


File "/usr/local/lib/python3.10/site-packages/bitsandbytes/functional.py", line 464, in is_on_gpu


raise RuntimeError(


RuntimeError: All input tensors need to be on the same GPU, but found some tensors to not be on a GPU:


[(torch.Size([4096, 4096]), device(type='meta'))]```

### Environment

Fill in the following information on your system.
- Operating system version: Debian GNU/Linux 12 (bookworm) (Docker)

If the backend impacted is PyTorch:
- Python version: 3.10
- PyTorch version: 2.6.0+cu124
- CUDA version (run `python -c 'import torch;  print(torch.version.cuda)'`): 12.4
- GPU model and memory: NVIDIA GeForce RTX 4070  12gb
  
If the backend is MLX:
- Mac model:


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Quantized model fails to load on Windows/Linux #271

Backend impacted

Operating system

Hardware

Description

Platforms Tested

Steps to Reproduce

Suspected Cause

Extra information

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Quantized model fails to load on Windows/Linux #271

Description

Backend impacted

Operating system

Hardware

Description

Platforms Tested

Steps to Reproduce

Suspected Cause

Extra information

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions