Skip to content

[ICLR2025]: OSTQuant: Refining Large Language Model Quantization with Orthogonal and Scaling Transformations for Better Distribution Fitting

License

Notifications You must be signed in to change notification settings

StiphyJay/OSTQuant

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

OSTQuant

Official code for ICLR2025 paper OSTQuant: Refining Large Language Model Quantization with Orthogonal and Scaling Transformations for Better Distribution Fitting

Motivation

Our motivation Fig1 : Transformation of a batch of data $X \sim \mathcal{N}({\mu}, {\Sigma})$ using different methods. Eigenvalues $\lambda_1$ and $\lambda_2$ represent the spread of the distribution along principal axes after eigenvalue decomposition of $\Sigma$ . (a) shows the original distribution, while (b), (c), and (d) illustrate the effects of the Smooth-based, Rotate-base, and ours OST-based methods, respectively, on QSUR.

OSTQuant Diagram

Fig2 : The overall flow diagram of OSTQuant. The top section of the figure illustrates how the global orthogonal transformation, $R_{res}$, along with the two scaling transformations, $S_{attn}$ and $S_{ffn}$, collaborate within each block to adjust the distributions across the entire network while maintaining computational invariance. The bottom section highlights four equivalent transformation pairs applied to the FFN and Self-Attention layers. Each fully-connected (FC) layer’s activation and weight are influenced by one or more of these transformation pairs. During runtime, these transformation pairs are fused with the weights, ensuring minimal runtime overhead.

Main Results

Table:Comparison of perplexity on WikiText2 and averaged accuracy on nine Zero-Shot tasks. Results for SmoothQuant, GPTQ, OmniQuant, AWQ, and QuaRot are based on official code and SpinQuant's results for LLaMA-2/3 using official weights, with LLaMA-1 from the official code.

#Bits W-A-KV Method LLaMA-3 8B LLaMA-2 7B LLaMA-2 13B LLaMA 7B LLaMA 13B LLaMA 30B
0-shot9 Wiki 0-shot9 Wiki 0-shot9 Wiki 0-shot9 Wiki 0-shot9 Wiki 0-shot9 Wiki
Avg.(↑) (↓) Avg.(↑) (↓) Avg.(↑) (↓) Avg.(↑) (↓) Avg.(↑) (↓) Avg.(↑) (↓)
16-16-16 FloatingPoint 68.09 6.14 65.21 5.47 67.61 4.88 64.48 5.68 66.67 5.09 70.00 4.10
4-16-16 RTN 63.70 8.13 61.27 7.02 60.24 6.39 62.67 7.94 63.45 8.60 65.69 6.13
SmoothQuant 62.79 8.12 58.88 8.03 62.03 5.86 62.24 7.46 62.69 18.75 65.69 5.80
GPTQ 61.03 7.43 60.86 9.84 64.71 5.79 60.15 7.93 64.36 6.58 66.95 5.26
OmniQuant 65.66 7.19 63.19 5.74 66.38 5.02 63.42 5.86 66.22 5.21 69.07 4.25
AWQ 67.03 7.36 63.89 5.83 66.25 5.07 63.30 5.97 65.58 5.28 69.44 4.28
QuaRot 67.27 6.53 64.30 5.62 66.95 5.00 63.40 5.83 65.91 5.20 69.73 4.27
SpinQuant 66.54 6.49 63.59 5.58 67.14 5.00 63.94 5.76 66.32 5.16 69.62 4.21
OSTQuant 67.80 6.53 64.37 5.64 67.31 4.94 64.13 5.81 66.62 5.21 69.84 4.19
4-4-16 RTN 33.42 6e2 32.44 nan 30.86 8e3 32.51 7e3 31.63 3e4 31.57 2e3
SmoothQuant 33.04 1e3 32.13 nan 34.26 1e3 34.42 3e2 33.29 6e2 34.64 1e3
GPTQ 32.98 5e2 32.72 nan 30.11 4e3 32.12 1e3 31.51 3e3 30.88 2e3
QuaRot 61.69 8.02 61.87 6.05 65.13 5.35 61.76 6.22 64.46 5.50 68.14 4.57
SpinQuant 64.11 7.28 57.37 6.78 63.23 5.24 61.82 6.08 64.59 5.36 68.08 4.53
OSTQuant 65.14 7.24 63.90 5.60 66.24 5.14 62.72 6.04 65.80 5.40 68.52 4.43
4-4-4 RTN 33.18 7e2 32.67 nan 30.93 7e3 32.87 1e4 31.33 3e4 31.64 2e3
SmoothQuant 32.96 1e3 32.12 nan 33.36 1e3 33.32 3e2 33.28 5e2 34.65 1e3
GPTQ 33.71 6e2 33.52 nan 27.85 5e3 31.80 2e3 30.63 3e3 31.07 2e3
OmniQuant 32.33 4e2 48.40 14.26 50.35 12.30 48.46 11.26 45.63 10.87 45.04 12.35
QuaRot 61.38 8.18 61.48 6.11 65.16 5.39 61.22 6.26 64.59 5.53 68.08 4.60
SpinQuant 64.10 7.35 62.01 5.96 64.13 5.74 61.32 6.12 64.95 5.39 68.14 4.55
OSTQuant 65.37 7.29 63.18 5.91 65.41 5.25 62.55 6.07 65.43 5.40 68.20 4.42

Optimize and Evaluate

You can repdoduce our results with the following command:

export CUDA_VISIBLE_DEVICES="0,1,2,3"
sh scripts/w4a16kv16.sh

Optimized Weights

Due to concerns that personal links could compromise anonymity, the officially optimized orthogonal and smoothing parameters were quickly released once the repository's anonymity was lifted.

Contributing

We welcome contributions from the research and development community! Whether you're interested in improving the existing features, adding new functionalities, or reporting issues, your input is invaluable.

Future Work

We plan to extend OSTQuant to FullyQuant, aiming to quantize all activations within a Transformer block. The Fig3 below demonstrates our design to improve the QSUR of activations across all layers.

OSTQuant for FullyQuant.

FullyQuant

Citation

If you find OSTQuant useful in your research, please consider citing our paper:

@inproceedings{hu2025ostquant,
title={{OSTQ}uant: Refining Large Language Model Quantization with Orthogonal and Scaling Transformations for Better Distribution Fitting},
author={Xing Hu, Yuan Cheng, Dawei Yang, Zhixuan Chen, Zukang Xu, Jiangyong Yu, Chen Xu, Zhihang Yuan, Zhe jiang and Sifan Zhou},
booktitle={The Thirteenth International Conference on Learning Representations},
year={2025},
url={https://openreview.net/forum?id=rAcgDBdKnP}
}

Star History

Star History Chart

About

[ICLR2025]: OSTQuant: Refining Large Language Model Quantization with Orthogonal and Scaling Transformations for Better Distribution Fitting

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 99.7%
  • Shell 0.3%