Official code for ICLR2025 paper OSTQuant: Refining Large Language Model Quantization with Orthogonal and Scaling Transformations for Better Distribution Fitting
Fig1 :
Transformation of a batch of data
Fig2 : The overall flow diagram of OSTQuant. The top section of the figure illustrates how the global orthogonal transformation,
Table:Comparison of perplexity on WikiText2 and averaged accuracy on nine Zero-Shot tasks. Results for SmoothQuant, GPTQ, OmniQuant, AWQ, and QuaRot are based on official code and SpinQuant's results for LLaMA-2/3 using official weights, with LLaMA-1 from the official code.
#Bits W-A-KV | Method | LLaMA-3 8B | LLaMA-2 7B | LLaMA-2 13B | LLaMA 7B | LLaMA 13B | LLaMA 30B | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0-shot9 | Wiki | 0-shot9 | Wiki | 0-shot9 | Wiki | 0-shot9 | Wiki | 0-shot9 | Wiki | 0-shot9 | Wiki | ||
Avg.(↑) | (↓) | Avg.(↑) | (↓) | Avg.(↑) | (↓) | Avg.(↑) | (↓) | Avg.(↑) | (↓) | Avg.(↑) | (↓) | ||
16-16-16 | FloatingPoint | 68.09 | 6.14 | 65.21 | 5.47 | 67.61 | 4.88 | 64.48 | 5.68 | 66.67 | 5.09 | 70.00 | 4.10 |
4-16-16 | RTN | 63.70 | 8.13 | 61.27 | 7.02 | 60.24 | 6.39 | 62.67 | 7.94 | 63.45 | 8.60 | 65.69 | 6.13 |
SmoothQuant | 62.79 | 8.12 | 58.88 | 8.03 | 62.03 | 5.86 | 62.24 | 7.46 | 62.69 | 18.75 | 65.69 | 5.80 | |
GPTQ | 61.03 | 7.43 | 60.86 | 9.84 | 64.71 | 5.79 | 60.15 | 7.93 | 64.36 | 6.58 | 66.95 | 5.26 | |
OmniQuant | 65.66 | 7.19 | 63.19 | 5.74 | 66.38 | 5.02 | 63.42 | 5.86 | 66.22 | 5.21 | 69.07 | 4.25 | |
AWQ | 67.03 | 7.36 | 63.89 | 5.83 | 66.25 | 5.07 | 63.30 | 5.97 | 65.58 | 5.28 | 69.44 | 4.28 | |
QuaRot | 67.27 | 6.53 | 64.30 | 5.62 | 66.95 | 5.00 | 63.40 | 5.83 | 65.91 | 5.20 | 69.73 | 4.27 | |
SpinQuant | 66.54 | 6.49 | 63.59 | 5.58 | 67.14 | 5.00 | 63.94 | 5.76 | 66.32 | 5.16 | 69.62 | 4.21 | |
OSTQuant | 67.80 | 6.53 | 64.37 | 5.64 | 67.31 | 4.94 | 64.13 | 5.81 | 66.62 | 5.21 | 69.84 | 4.19 | |
4-4-16 | RTN | 33.42 | 6e2 | 32.44 | nan | 30.86 | 8e3 | 32.51 | 7e3 | 31.63 | 3e4 | 31.57 | 2e3 |
SmoothQuant | 33.04 | 1e3 | 32.13 | nan | 34.26 | 1e3 | 34.42 | 3e2 | 33.29 | 6e2 | 34.64 | 1e3 | |
GPTQ | 32.98 | 5e2 | 32.72 | nan | 30.11 | 4e3 | 32.12 | 1e3 | 31.51 | 3e3 | 30.88 | 2e3 | |
QuaRot | 61.69 | 8.02 | 61.87 | 6.05 | 65.13 | 5.35 | 61.76 | 6.22 | 64.46 | 5.50 | 68.14 | 4.57 | |
SpinQuant | 64.11 | 7.28 | 57.37 | 6.78 | 63.23 | 5.24 | 61.82 | 6.08 | 64.59 | 5.36 | 68.08 | 4.53 | |
OSTQuant | 65.14 | 7.24 | 63.90 | 5.60 | 66.24 | 5.14 | 62.72 | 6.04 | 65.80 | 5.40 | 68.52 | 4.43 | |
4-4-4 | RTN | 33.18 | 7e2 | 32.67 | nan | 30.93 | 7e3 | 32.87 | 1e4 | 31.33 | 3e4 | 31.64 | 2e3 |
SmoothQuant | 32.96 | 1e3 | 32.12 | nan | 33.36 | 1e3 | 33.32 | 3e2 | 33.28 | 5e2 | 34.65 | 1e3 | |
GPTQ | 33.71 | 6e2 | 33.52 | nan | 27.85 | 5e3 | 31.80 | 2e3 | 30.63 | 3e3 | 31.07 | 2e3 | |
OmniQuant | 32.33 | 4e2 | 48.40 | 14.26 | 50.35 | 12.30 | 48.46 | 11.26 | 45.63 | 10.87 | 45.04 | 12.35 | |
QuaRot | 61.38 | 8.18 | 61.48 | 6.11 | 65.16 | 5.39 | 61.22 | 6.26 | 64.59 | 5.53 | 68.08 | 4.60 | |
SpinQuant | 64.10 | 7.35 | 62.01 | 5.96 | 64.13 | 5.74 | 61.32 | 6.12 | 64.95 | 5.39 | 68.14 | 4.55 | |
OSTQuant | 65.37 | 7.29 | 63.18 | 5.91 | 65.41 | 5.25 | 62.55 | 6.07 | 65.43 | 5.40 | 68.20 | 4.42 |
You can repdoduce our results with the following command:
export CUDA_VISIBLE_DEVICES="0,1,2,3"
sh scripts/w4a16kv16.sh
Due to concerns that personal links could compromise anonymity, the officially optimized orthogonal and smoothing parameters were quickly released once the repository's anonymity was lifted.
We welcome contributions from the research and development community! Whether you're interested in improving the existing features, adding new functionalities, or reporting issues, your input is invaluable.
We plan to extend OSTQuant to FullyQuant, aiming to quantize all activations within a Transformer block. The Fig3 below demonstrates our design to improve the QSUR of activations across all layers.
If you find OSTQuant useful in your research, please consider citing our paper:
@inproceedings{hu2025ostquant,
title={{OSTQ}uant: Refining Large Language Model Quantization with Orthogonal and Scaling Transformations for Better Distribution Fitting},
author={Xing Hu, Yuan Cheng, Dawei Yang, Zhixuan Chen, Zukang Xu, Jiangyong Yu, Chen Xu, Zhihang Yuan, Zhe jiang and Sifan Zhou},
booktitle={The Thirteenth International Conference on Learning Representations},
year={2025},
url={https://openreview.net/forum?id=rAcgDBdKnP}
}