-
Notifications
You must be signed in to change notification settings - Fork 12.4k
Closed
Labels
Description
Add Q2_0
and Q2_1
quantization support to ggml
:
- Follow the existing
Q4_0
andQ4_1
implementations - Implement reference scalar quantization and dequantization routines
- I suspect we might have to use
QK == 16
in this case to compensate for further accuracy losses - Add SIMD support for a specific architecture - investigate best strategy to perform the
ggml_vec_dot_q2()
computation - No need to implement
ggml_vec_mad_q2()
- these will be deprecated soon - Compute perplexity scores
The expected model sizes for 7B and QK == 16
are:
Q2_0
- 3.2 GB
For QK == 32
we have:
Q2_0
- 2.4 GBQ2_1
- 3.2 GB
Before you send me papers that show 2-bit quantization does not work - no need. I want to have this supported anyway. I have something in mind. The efforts needed to add this support are so small that there is no reason not to do it.
myeolinmalchi, watsy0007, gorborukov, JamoDevNich, jackvial and 15 moreGreen-Sky, lin72h, NouamaneTazi, schneiderfelipe, TheSeamau5 and 2 moresevenreasons, NouamaneTazi and lolxdmainkaisemaanlu