-
Notifications
You must be signed in to change notification settings - Fork 772
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Backend impacted
The PyTorch implementation
Operating system
Linux
Hardware
GPU with CUDA
Description
When trying to load quantized Moshi models (kyutai/moshiko-pytorch-q8
orkyutai/moshika-pytorch-q8
) on both Windows and Docker Compose, the app crashes during model initialization due to a meta
device issue.
RuntimeError: All input tensors need to be on the same GPU, but found some tensors to not be on a GPU:
[(torch.Size([4096, 4096]), device(type='meta'))]
Traceback points to QLinear
:
CB, SCB, _ = bnbF.int8_vectorwise_quant(weight.data.to(torch.float16))
Platforms Tested
- Windows 10 (Anaconda, CUDA available)
- Docker Compose (Ubuntu-based container)
Steps to Reproduce
Run python -m moshi.server --hf-repo kyutai/moshiko-pytorch-q8 --device cuda
.
Suspected Cause
The model is first initialized with fake (meta) tensors, and actual weights are loaded afterward. However, during initialization, replace_linear_with_qlinear() is called, it receives these fake meta tensors and passes them to bitsandbytes quantization routines, which expect real tensors on a valid device (CPU/GPU), not meta.
Extra information
[Info] loading mimi
[Info] mimi loaded
[Info] loading moshi
Traceback (most recent call last):
File "/usr/local/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/local/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/app/moshi/server.py", line 291, in <module>
main()
File "/app/moshi/server.py", line 237, in main
lm = checkpoint_info.get_moshi(device=args.device, dtype=args.dtype, fuse_lora=args.fuse_lora)
File "/app/moshi/models/loaders.py", line 258, in get_moshi
model = get_moshi_lm(
File "/app/moshi/models/loaders.py", line 357, in get_moshi_lm
model = LMModel(
File "/app/moshi/models/lm.py", line 132, in __init__
self.transformer = StreamingTransformer(
File "/app/moshi/modules/transformer.py", line 861, in __init__
replace_linear_with_qlinear(self.layers[-1])
File "/app/moshi/utils/quantize.py", line 62, in replace_linear_with_qlinear
replace_linear_with_qlinear(child)
File "/app/moshi/utils/quantize.py", line 62, in replace_linear_with_qlinear
replace_linear_with_qlinear(child)
File "/app/moshi/utils/quantize.py", line 52, in replace_linear_with_qlinear
setattr(module, name, QLinear(child))
File "/app/moshi/utils/quantize.py", line 25, in __init__
CB, SCB, _ = bnbF.int8_vectorwise_quant(weight.data.to(torch.float16)) # type: ignore
File "/usr/local/lib/python3.10/site-packages/bitsandbytes/functional.py", line 2772, in int8_vectorwise_quant
is_on_gpu([A])
File "/usr/local/lib/python3.10/site-packages/bitsandbytes/functional.py", line 464, in is_on_gpu
raise RuntimeError(
RuntimeError: All input tensors need to be on the same GPU, but found some tensors to not be on a GPU:
[(torch.Size([4096, 4096]), device(type='meta'))]```
### Environment
Fill in the following information on your system.
- Operating system version: Debian GNU/Linux 12 (bookworm) (Docker)
If the backend impacted is PyTorch:
- Python version: 3.10
- PyTorch version: 2.6.0+cu124
- CUDA version (run `python -c 'import torch; print(torch.version.cuda)'`): 12.4
- GPU model and memory: NVIDIA GeForce RTX 4070 12gb
If the backend is MLX:
- Mac model:
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working