Skip to content

Quantized model fails to load on Windows/Linux #271

@LukaDarsalia

Description

@LukaDarsalia

Backend impacted

The PyTorch implementation

Operating system

Linux

Hardware

GPU with CUDA

Description

When trying to load quantized Moshi models (kyutai/moshiko-pytorch-q8 orkyutai/moshika-pytorch-q8) on both Windows and Docker Compose, the app crashes during model initialization due to a meta device issue.

RuntimeError: All input tensors need to be on the same GPU, but found some tensors to not be on a GPU:
[(torch.Size([4096, 4096]), device(type='meta'))]

Traceback points to QLinear:

CB, SCB, _ = bnbF.int8_vectorwise_quant(weight.data.to(torch.float16))

Platforms Tested

  • Windows 10 (Anaconda, CUDA available)
  • Docker Compose (Ubuntu-based container)

Steps to Reproduce

Run python -m moshi.server --hf-repo kyutai/moshiko-pytorch-q8 --device cuda .

Suspected Cause

The model is first initialized with fake (meta) tensors, and actual weights are loaded afterward. However, during initialization, replace_linear_with_qlinear() is called, it receives these fake meta tensors and passes them to bitsandbytes quantization routines, which expect real tensors on a valid device (CPU/GPU), not meta.

Extra information



[Info] loading mimi


[Info] mimi loaded


[Info] loading moshi


Traceback (most recent call last):


File "/usr/local/lib/python3.10/runpy.py", line 196, in _run_module_as_main


return _run_code(code, main_globals, None,


File "/usr/local/lib/python3.10/runpy.py", line 86, in _run_code


exec(code, run_globals)


File "/app/moshi/server.py", line 291, in <module>


main()


File "/app/moshi/server.py", line 237, in main


lm = checkpoint_info.get_moshi(device=args.device, dtype=args.dtype, fuse_lora=args.fuse_lora)


File "/app/moshi/models/loaders.py", line 258, in get_moshi


model = get_moshi_lm(


File "/app/moshi/models/loaders.py", line 357, in get_moshi_lm


model = LMModel(


File "/app/moshi/models/lm.py", line 132, in __init__


self.transformer = StreamingTransformer(


File "/app/moshi/modules/transformer.py", line 861, in __init__


replace_linear_with_qlinear(self.layers[-1])


File "/app/moshi/utils/quantize.py", line 62, in replace_linear_with_qlinear


replace_linear_with_qlinear(child)


File "/app/moshi/utils/quantize.py", line 62, in replace_linear_with_qlinear


replace_linear_with_qlinear(child)


File "/app/moshi/utils/quantize.py", line 52, in replace_linear_with_qlinear


setattr(module, name, QLinear(child))


File "/app/moshi/utils/quantize.py", line 25, in __init__


CB, SCB, _ = bnbF.int8_vectorwise_quant(weight.data.to(torch.float16)) # type: ignore


File "/usr/local/lib/python3.10/site-packages/bitsandbytes/functional.py", line 2772, in int8_vectorwise_quant


is_on_gpu([A])


File "/usr/local/lib/python3.10/site-packages/bitsandbytes/functional.py", line 464, in is_on_gpu


raise RuntimeError(


RuntimeError: All input tensors need to be on the same GPU, but found some tensors to not be on a GPU:


[(torch.Size([4096, 4096]), device(type='meta'))]```

### Environment

Fill in the following information on your system.
- Operating system version: Debian GNU/Linux 12 (bookworm) (Docker)

If the backend impacted is PyTorch:
- Python version: 3.10
- PyTorch version: 2.6.0+cu124
- CUDA version (run `python -c 'import torch;  print(torch.version.cuda)'`): 12.4
- GPU model and memory: NVIDIA GeForce RTX 4070  12gb
  
If the backend is MLX:
- Mac model:

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions