Support Qwen VL visual projector's 4-bit quantization, not only fp16 #11408

samkoesnadi · 2025-01-25T07:48:32Z

samkoesnadi
Jan 25, 2025

I have been experimenting with Llama cpp's VLM. Mainly now developing with the Qwen 2 VL model for my project.

The text decoder runs super fast on my machine as I am using 4 bit quantization. However, there are only two possible quantization, namely fp16 and fp32.

How difficult it is for us to implement 4-bit quantization for the visual projector?

As a reference, I run the model on Redmi Note 13 Pro. The visual projector runs around 3 tokens per second, whereas the text decoder runs on 7 tokens/s. So it is quite a margin. As performance is crucial on mobile, so having the visual projector quantized would be very helpful.

samkoesnadi · 2025-01-26T04:13:12Z

samkoesnadi
Jan 26, 2025
Author

No interest?

0 replies

samkoesnadi · 2025-02-04T06:08:20Z

samkoesnadi
Feb 4, 2025
Author

In case anyone needs it: #11644

1 reply

matbeedotcom Mar 17, 2025

Have you tested it?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support Qwen VL visual projector's 4-bit quantization, not only fp16 #11408

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Support Qwen VL visual projector's 4-bit quantization, not only fp16 #11408

Uh oh!

samkoesnadi Jan 25, 2025

Replies: 2 comments · 1 reply

Uh oh!

samkoesnadi Jan 26, 2025 Author

Uh oh!

samkoesnadi Feb 4, 2025 Author

Uh oh!

matbeedotcom Mar 17, 2025

samkoesnadi
Jan 25, 2025

Replies: 2 comments 1 reply

samkoesnadi
Jan 26, 2025
Author

samkoesnadi
Feb 4, 2025
Author