A new 8-bit quantization method (PQ-R) with 3x higher SNR for CPU

### Feature request

This is a feature request to add a new 8-bit quantization method called **Product Quantization with Residuals (PQ-R)** to the `bitsandbytes` library.

**What is PQ-R?**
PQ-R is a hybrid quantization algorithm that combines the strengths of K-Means clustering and residual correction. It consistently delivers a much higher Signal-to-Noise Ratio (SNR) and reconstruction quality compared to standard `int8_linear` methods, making it ideal for users who need to minimize quality loss on CPU or edge devices.

**Key Benefits:**
-   **High Fidelity:** Achieves **~3x higher SNR** than standard INT8 on challenging model layers.
-   **Practical Performance:** It is over **20x faster** than pure K-Means, making it a viable and efficient tool for developers.
-   **Proven:** The method has been rigorously benchmarked on the TinyLlama 1.1B model, demonstrating its ability to compress the model to ~1 GB while maintaining high quality.

This feature would provide `bitsandbytes` users with a powerful new quantization option that bridges the gap between fast-but-low-quality `int8` and high-quality-but-impractical `kmeans`.

### Motivation

**The Problem:** There is a significant gap in the current landscape of 8-bit quantization techniques.
1.  **Standard `int8` methods** are fast but often result in a severe quality drop (low SNR), especially on layers with complex or outlier-heavy weight distributions.
2.  **High-quality methods** like full 256-center K-Means are computationally infeasible for practical use, taking minutes to compress a single layer.

**The Proposal:** I have developed and rigorously benchmarked a hybrid method, **Product Quantization with Residuals (PQ-R)**, that directly solves this problem. It provides a practical path to achieving near-optimal quality with reasonable performance on commodity CPU hardware.

My benchmarks on various TinyLlama layers show that PQ-R delivers:

-   **~3x higher SNR** than standard `int8_linear` quantization on challenging layers (e.g., **34.2 dB** vs 25.8 dB).
-   It is **over 20x faster** than pure K-Means, making it a viable tool for developers.
-   It enables compressing a 1.1B model to **~1 GB** while maintaining a high average quality of **~32 dB SNR**.

This method could provide `bitsandbytes` users with a powerful new option for high-quality quantization, especially for CPU and edge device deployments.

I've published a full technical write-up on Medium with all the graphs and data:
**[[Medium]](https://medium.com/@alex42ff/beating-int8-how-i-squeezed-a-1-1b-llm-to-1gb-on-a-cpu-with-3x-better-quality-600e986fda0d)**

The full testing suite and detailed analysis are also available on GitHub:
**[[GitHub]](https://github.com/AlexSheff/pqr-llm-quantization/tree/main)**

### Your contribution

Absolutely. I would be happy to help with the integration.

The core algorithm is fully implemented in Python (using scikit-learn and NumPy) and has been thoroughly tested, as demonstrated in the linked repository.

While the core `pqr_core.py` implementation is currently proprietary, I am very open to discussing the best way to integrate it into `bitsandbytes`. This could involve:

-   Providing a reference implementation or code snippets for the key parts of the algorithm.
-   Collaborating with your team to build an optimized version that fits the library's architecture (e.g., CUDA kernels if desired).
-   Discussing licensing options that would work for both the project and myself.

I am confident this method would be a valuable addition to the library and am ready to assist in making it happen.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

A new 8-bit quantization method (PQ-R) with 3x higher SNR for CPU #1759

Feature request

Motivation

Your contribution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

A new 8-bit quantization method (PQ-R) with 3x higher SNR for CPU #1759

Description

Feature request

Motivation

Your contribution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions