Skip to content

A new 8-bit quantization method (PQ-R) with 3x higher SNR for CPU #1759

@AlexSheff

Description

@AlexSheff

Feature request

This is a feature request to add a new 8-bit quantization method called Product Quantization with Residuals (PQ-R) to the bitsandbytes library.

What is PQ-R?
PQ-R is a hybrid quantization algorithm that combines the strengths of K-Means clustering and residual correction. It consistently delivers a much higher Signal-to-Noise Ratio (SNR) and reconstruction quality compared to standard int8_linear methods, making it ideal for users who need to minimize quality loss on CPU or edge devices.

Key Benefits:

  • High Fidelity: Achieves ~3x higher SNR than standard INT8 on challenging model layers.
  • Practical Performance: It is over 20x faster than pure K-Means, making it a viable and efficient tool for developers.
  • Proven: The method has been rigorously benchmarked on the TinyLlama 1.1B model, demonstrating its ability to compress the model to ~1 GB while maintaining high quality.

This feature would provide bitsandbytes users with a powerful new quantization option that bridges the gap between fast-but-low-quality int8 and high-quality-but-impractical kmeans.

Motivation

The Problem: There is a significant gap in the current landscape of 8-bit quantization techniques.

  1. Standard int8 methods are fast but often result in a severe quality drop (low SNR), especially on layers with complex or outlier-heavy weight distributions.
  2. High-quality methods like full 256-center K-Means are computationally infeasible for practical use, taking minutes to compress a single layer.

The Proposal: I have developed and rigorously benchmarked a hybrid method, Product Quantization with Residuals (PQ-R), that directly solves this problem. It provides a practical path to achieving near-optimal quality with reasonable performance on commodity CPU hardware.

My benchmarks on various TinyLlama layers show that PQ-R delivers:

  • ~3x higher SNR than standard int8_linear quantization on challenging layers (e.g., 34.2 dB vs 25.8 dB).
  • It is over 20x faster than pure K-Means, making it a viable tool for developers.
  • It enables compressing a 1.1B model to ~1 GB while maintaining a high average quality of ~32 dB SNR.

This method could provide bitsandbytes users with a powerful new option for high-quality quantization, especially for CPU and edge device deployments.

I've published a full technical write-up on Medium with all the graphs and data:
[Medium]

The full testing suite and detailed analysis are also available on GitHub:
[GitHub]

Your contribution

Absolutely. I would be happy to help with the integration.

The core algorithm is fully implemented in Python (using scikit-learn and NumPy) and has been thoroughly tested, as demonstrated in the linked repository.

While the core pqr_core.py implementation is currently proprietary, I am very open to discussing the best way to integrate it into bitsandbytes. This could involve:

  • Providing a reference implementation or code snippets for the key parts of the algorithm.
  • Collaborating with your team to build an optimized version that fits the library's architecture (e.g., CUDA kernels if desired).
  • Discussing licensing options that would work for both the project and myself.

I am confident this method would be a valuable addition to the library and am ready to assist in making it happen.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions