What is the principle of quantize(eg. Q4_0)? #3546
-
I want to learn the principle of the quantize, but it's a little difficult for me to read the source code. So I need some formula or describe,thank you! |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
The purpose of course is to try to use less storage space/memory (it also can speed up running the model). The principle is to try to fit a large range of values into a smaller range while preserving as much accuracy as possible. Not sure if it'll help you, but I wrote a pretty simple Python implementation of import numpy as np
#### Mini Q8_0 quantization in Python
QK8_0 = 32
BLOCK_Q8_0 = np.dtype([('d', '<f2'), ('qs', 'i1', (QK8_0,))])
def quantize_array_q8_0(arr):
assert arr.size % QK8_0 == 0 and arr.size != 0, f'Bad array size {arr.size}'
assert arr.dtype == np.float32, f'Bad array type {arr.dtype}'
n_blocks = arr.size // QK8_0
blocks = arr.reshape((n_blocks, QK8_0))
return np.fromiter(map(quantize_block_q8_0, blocks), count = n_blocks, dtype = BLOCK_Q8_0)
def quantize_block_q8_0(blk):
d = abs(blk).max() / np.float32(127)
if d == np.float32(0):
return (np.float16(d), np.int8(0),) * QK8_0)
return (np.float16(d), (blk * (np.float32(1) / d)).round()) There's a much faster (but possibly harder to understand) version in |
Beta Was this translation helpful? Give feedback.
The purpose of course is to try to use less storage space/memory (it also can speed up running the model). The principle is to try to fit a large range of values into a smaller range while preserving as much accuracy as possible.
Not sure if it'll help you, but I wrote a pretty simple Python implementation of
q8_0
a while back: