k-quants are scary! #4140

KerfuffleV2 · 2023-11-20T01:22:40Z

KerfuffleV2
Nov 20, 2023
Collaborator

Story time, because I just got something working!

A while back, I added the ability for convert.py to convert models directly to Q8_0 format. Q8_0 quantization is pretty simple:

class Q8_0:
    block_size = 32
    dtype = np.dtype([("d", "f2"), ("qs", "i1", (block_size,))])

    # Mini Q8_0 quantization in Python!
    @classmethod
    def quantize(cls, arr: npt.NDArray[np.float32]) -> npt.NDArray[np.uint8]:
        n_blocks = arr.size // cls.block_size
        blocks = arr.reshape((n_blocks, cls.block_size))
        return np.fromiter(
            cls._quantize_blocks(blocks),
            count=n_blocks,
            dtype=cls.dtype,
        )

    @classmethod
    def _quantize_blocks(cls, blocks: npt.NDArray[Any]) -> Iterable[tuple[Any, Any]]:
        d = abs(blocks).max(axis=1) / np.float32(127)
        with np.errstate(divide="ignore"):
            qs = (blocks / d[:, None]).round()
        qs[d == 0] = 0
        yield from zip(d, qs)

    @classmethod
    def dequantize(cls, arr: npt.NDArray[np.uint8]) -> npt.NDArray[np.float32]:
        blocks = arr.view(dtype=cls.dtype)
        return (blocks["d"][:, None] * np.float32(blocks["qs"])).flatten()

(My original version had explicit loops and stuff, cebtenzzre numpy-ized it and made it actually perform fast enough to be usable.)

Anyway, I've been planning to try to port other quantizations to Python just to mess around with. My ultimate goal is to make a 1-bit K-quant type quantization even though I know it will be useless. I finally got around to it today.

Oh man, k-quants are so much more complicated than something like Q8_0. I finally got it working and producing the same output as the C version (at least for a block that just has -128 through 127).

Feast your eyes on this:

Expand... if you dare!

class Quantized_K_256(Quantized):  # noqa: N801
    qk_k = 256
    k_scale_size = 12

    @classmethod
    def make_qkx2_quants(  # noqa: PLR0913
        cls,
        nmax: int,
        x: npt.NDArray[np.float32],
        weights: npt.NDArray[np.float32],
        l: npt.NDArray[np.uint8],  # noqa: E741
        laux: npt.NDArray[np.uint8],
        rmin: np.float32,
        rdelta: np.float32,
        nstep: int,
        use_mad: bool,  # noqa: FBT001
    ) -> tuple[np.float32, np.float32]:
        f32 = np.float32
        u8 = np.uint8
        qmin = min(f32(0), np.min(x))
        qmax = np.max(x)
        if qmax == qmin:
            l.fill(0)
            return (f32(0), -qmin)

        sum_w = np.sum(weights)
        sum_x = np.sum(weights * x)
        iscale = f32(nmax / (qmax - qmin))
        scale = f32(1.0) / iscale
        np.maximum(0, np.minimum(nmax, u8(np.rint(iscale * (x - qmin)))), out=l)
        if nstep < 1:
            return (scale, -qmin)
        diffs = scale * l + qmin - x
        if use_mad:
            np.abs(diffs, out=diffs)
        else:
            np.square(diffs, out=diffs)
        best_mad = np.sum(weights * diffs)

        for is_ in range(nstep + 1):
            iscale = (rmin + rdelta * is_ + nmax) / (qmax - qmin)
            np.maximum(0, np.minimum(nmax, u8(np.rint(iscale * (x - qmin)))), out=laux)

            sum_l = np.sum(weights * laux)
            sum_l2 = np.sum(weights * laux * laux)
            sum_xl = np.sum(weights * laux * x)
            d = f32(sum_w * sum_l2 - sum_l * sum_l)
            if d <= 0:
                continue
            this_scale = f32(sum_w * sum_xl - sum_x * sum_l) / d
            this_min = f32(sum_l2 * sum_x - sum_l * sum_xl) / d
            if this_min > 0:
                this_min = f32(0)
                this_scale = sum_xl / sum_l2
            diffs = this_scale * laux + this_min - x
            if use_mad:
                np.abs(diffs, out=diffs)
            else:
                np.square(diffs, out=diffs)
            mad = np.sum(weights * diffs)
            if mad >= best_mad:
                continue
            np.copyto(l, laux)
            best_mad = mad
            scale = this_scale
            qmin = this_min
        return (scale, -qmin)


def get_scale_min_k4(j: int, arr: npt.NDArray[np.uint8]) -> tuple[np.uint8, np.uint8]:
    if j < 4:
        return (arr[j] & 63, arr[j + 4] & 63)
    d = (arr[j + 4] & 0xF) | ((arr[j - 4] >> 6) << 4)
    m = (arr[j + 4] >> 4) | ((arr[j] >> 6) << 4)
    return (d, m)


def quantize_block_q4_k(block: npt.NDArray[np.float32]) -> None:
    f32 = np.float32
    u8 = np.uint8
    qk_k = Quantized_K_256.qk_k
    if block.size != qk_k:
        raise ValueError("ohno")
    scales_n = qk_k // 32
    l = np.zeros((scales_n, 32), dtype=u8)
    laux = np.zeros((32,), dtype=u8)
    weights = np.zeros((32,), dtype=f32)
    mins = np.zeros((scales_n,), dtype=f32)
    scales = np.zeros((scales_n,), dtype=f32)
    max_scale = f32(0)
    max_min = f32(0)
    chunks = block.reshape((scales_n, 32))
    for j, chunk in enumerate(chunks):
        sum_x2 = np.sum(np.square(chunk))
        av_x = np.sqrt(sum_x2 / 32)
        np.add(av_x, np.fabs(chunk), out=weights)
        (currscale, currmin) = Quantized_K_256.make_qkx2_quants(
            15,
            chunk,
            weights,
            l[j],
            laux,
            f32(-1),
            f32(0.1),
            20,
            use_mad=False,
        )
        scales[j] = currscale
        mins[j] = currmin
        max_scale = max(scales[j], max_scale)
        max_min = max(mins[j], max_min)
    inv_scale = f32(0 if max_scale <= 0 else 63.0 / max_scale)
    inv_min = f32(0 if max_min <= 0 else 63.0 / max_min)
    scales_out = np.zeros((Quantized_K_256.k_scale_size,), dtype=u8)
    ls = np.minimum(63, np.rint(inv_scale * scales)).astype(u8)
    lm = np.minimum(63, np.rint(inv_min * mins)).astype(u8)
    for j, (currls, currlm) in enumerate(zip(ls, lm)):
        if j < 4:
            scales_out[j] = currls
            scales_out[j + 4] = currlm
            continue
        scales_out[j + 4] = (currls & 0xF) | ((currlm & 0xF) << 4)
        scales_out[j - 4] |= (currls >> 4) << 6
        scales_out[j] |= (currlm >> 4) << 6
    d_out = np.float16(max_scale / 63.0)
    dmin_out = np.float16(max_min / 63.0)
    lflat = l.reshape((qk_k,))
    for j in range(scales_n):
        sc, m = get_scale_min_k4(j, scales_out)
        d = f32(f32(d_out) * sc)
        if d == 0:
            continue
        dm = f32(f32(dmin_out) * m)
        for ii in range(32):
            currl = max(0, min(15, np.rint((block[32 * j + ii] + dm) / d)))
            lflat[32 * j + ii] = currl
    qs_out = np.zeros((qk_k // 2,), dtype=u8)
    for j2, qc in zip(range(0, qk_k, 64), qs_out.reshape((4, 32))):
        for i in range(32):
            qc[i] = lflat[j2 + i] | (lflat[j2 + i + 32] << 4)
    dequantize_block_q4_k(d_out, dmin_out, scales_out, qs_out)


def dequantize_block_q4_k(
    d_: np.float16,
    dmin_: np.float16,
    scales: npt.NDArray[np.uint8],
    qs: npt.NDArray[np.uint8],
) -> None:
    d = np.float32(d_)
    dmin = np.float32(dmin_)
    print("==> dequant", d, dmin, "\nscales:", scales.tolist(), "\nqs: ", qs.tolist())
    is_ = 0

    out_chunks = []
    for q in qs.reshape((4, 32)):
        sc, m = get_scale_min_k4(is_, scales)
        d1 = np.float32(d * sc)
        m1 = np.float32(dmin * m)
        sc, m = get_scale_min_k4(is_ + 1, scales)
        d2 = np.float32(d * sc)
        m2 = np.float32(dmin * m)
        print("-->", d1, m1, d2, m2)
        out_chunks.append(d1 * (q & 0xF) - m1)
        out_chunks.append(d2 * (q >> 4) - m2)
        is_ += 2
    print("OUT", out_chunks)


quantize_block_q4_k(np.arange(-128, 128, dtype=np.float32))

Obviously it's not useful at all in its current hacked together just barely working state but it actually does seem to produce the same output as the C version.

cmp-nct · 2023-11-20T15:13:37Z

cmp-nct
Nov 20, 2023

Well .. when running something like a 3.6TB model 1.X bit quantization might come handy :)
https://huggingface.co/google/switch-c-2048

But getting perplexity tests is going to be interesting :)

3 replies

KerfuffleV2 Nov 20, 2023
Collaborator Author

1.X bit quantization might come handy :)

Oh, I wasn't talking about 1bit quantization being useless in general. I meant my janky naive version of it would be. :) It also kind of looks like it would be hard to apply to existing models, it's more something the model has to be trained with awareness of.

I'm assuming if it was something that was easy to do we'd already see examples of 1bit quantization out there (even if it didn't perform all that well). I actually spent a fair amount of time looking around and wasn't able to find any example that will take an existing model and quantize it to 1bit though.

cmp-nct Nov 20, 2023

In this area we are working in, there is a lot that has not been done before despite it being a step forward, that's why I personally wouldn't view the lack of 1bit tests to mean much.
People on python in vast majority use 32 bit, the organizations who produce models sometimes don't even seem to be aware that models could be run on lower quantization or resulting performance differences. (I recall some talk of the Falcon devs, <32bit wasn't in their mind)

The number of people working on low quantization implementation worldwide is likely quite small

KerfuffleV2 Nov 20, 2023
Collaborator Author

That's true, but I can't really believe when ikawrakow was making Q2_K it just didn't even occur to him to try Q1_K. :) I'd basically have to think I'd be able to do a better job at it, which is... Not likely.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

k-quants are scary! #4140

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

k-quants are scary! #4140

Uh oh!

Uh oh!

KerfuffleV2 Nov 20, 2023 Collaborator

Replies: 1 comment · 3 replies

Uh oh!

cmp-nct Nov 20, 2023

Uh oh!

KerfuffleV2 Nov 20, 2023 Collaborator Author

Uh oh!

Uh oh!

cmp-nct Nov 20, 2023

Uh oh!

Uh oh!

KerfuffleV2 Nov 20, 2023 Collaborator Author

KerfuffleV2
Nov 20, 2023
Collaborator

Replies: 1 comment 3 replies

cmp-nct
Nov 20, 2023

KerfuffleV2 Nov 20, 2023
Collaborator Author

KerfuffleV2 Nov 20, 2023
Collaborator Author