Support Qwen VL visual projector's 4-bit quantization, not only fp16 #11408
Closed
samkoesnadi
started this conversation in
Ideas
Replies: 2 comments 1 reply
-
No interest? |
Beta Was this translation helpful? Give feedback.
0 replies
-
In case anyone needs it: #11644 |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I have been experimenting with Llama cpp's VLM. Mainly now developing with the Qwen 2 VL model for my project.
The text decoder runs super fast on my machine as I am using 4 bit quantization. However, there are only two possible quantization, namely fp16 and fp32.
How difficult it is for us to implement 4-bit quantization for the visual projector?
As a reference, I run the model on Redmi Note 13 Pro. The visual projector runs around 3 tokens per second, whereas the text decoder runs on 7 tokens/s. So it is quite a margin. As performance is crucial on mobile, so having the visual projector quantized would be very helpful.
Beta Was this translation helpful? Give feedback.
All reactions