-
-
Notifications
You must be signed in to change notification settings - Fork 791
Description
Feature request
This is a feature request to add a new 8-bit quantization method called Product Quantization with Residuals (PQ-R) to the bitsandbytes library.
What is PQ-R?
PQ-R is a hybrid quantization algorithm that combines the strengths of K-Means clustering and residual correction. It consistently delivers a much higher Signal-to-Noise Ratio (SNR) and reconstruction quality compared to standard int8_linear methods, making it ideal for users who need to minimize quality loss on CPU or edge devices.
Key Benefits:
- High Fidelity: Achieves ~3x higher SNR than standard INT8 on challenging model layers.
- Practical Performance: It is over 20x faster than pure K-Means, making it a viable and efficient tool for developers.
- Proven: The method has been rigorously benchmarked on the TinyLlama 1.1B model, demonstrating its ability to compress the model to ~1 GB while maintaining high quality.
This feature would provide bitsandbytes users with a powerful new quantization option that bridges the gap between fast-but-low-quality int8 and high-quality-but-impractical kmeans.
Motivation
The Problem: There is a significant gap in the current landscape of 8-bit quantization techniques.
- Standard
int8methods are fast but often result in a severe quality drop (low SNR), especially on layers with complex or outlier-heavy weight distributions. - High-quality methods like full 256-center K-Means are computationally infeasible for practical use, taking minutes to compress a single layer.
The Proposal: I have developed and rigorously benchmarked a hybrid method, Product Quantization with Residuals (PQ-R), that directly solves this problem. It provides a practical path to achieving near-optimal quality with reasonable performance on commodity CPU hardware.
My benchmarks on various TinyLlama layers show that PQ-R delivers:
- ~3x higher SNR than standard
int8_linearquantization on challenging layers (e.g., 34.2 dB vs 25.8 dB). - It is over 20x faster than pure K-Means, making it a viable tool for developers.
- It enables compressing a 1.1B model to ~1 GB while maintaining a high average quality of ~32 dB SNR.
This method could provide bitsandbytes users with a powerful new option for high-quality quantization, especially for CPU and edge device deployments.
I've published a full technical write-up on Medium with all the graphs and data:
[Medium]
The full testing suite and detailed analysis are also available on GitHub:
[GitHub]
Your contribution
Absolutely. I would be happy to help with the integration.
The core algorithm is fully implemented in Python (using scikit-learn and NumPy) and has been thoroughly tested, as demonstrated in the linked repository.
While the core pqr_core.py implementation is currently proprietary, I am very open to discussing the best way to integrate it into bitsandbytes. This could involve:
- Providing a reference implementation or code snippets for the key parts of the algorithm.
- Collaborating with your team to build an optimized version that fits the library's architecture (e.g., CUDA kernels if desired).
- Discussing licensing options that would work for both the project and myself.
I am confident this method would be a valuable addition to the library and am ready to assist in making it happen.