-
Notifications
You must be signed in to change notification settings - Fork 12.4k
convert-hf : support direct Q8_0 conversion #7234
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 2 commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,109 @@ | ||
from __future__ import annotations | ||
from typing import Callable | ||
|
||
from numpy.typing import DTypeLike | ||
|
||
from .constants import GGML_QUANT_SIZES, GGMLQuantizationType | ||
from .lazy import LazyNumpyTensor | ||
|
||
import numpy as np | ||
|
||
|
||
# same as ggml_compute_fp32_to_bf16 in ggml-impl.h | ||
def __compute_fp32_to_bf16(n: np.ndarray) -> np.ndarray: | ||
n = n.astype(np.float32, copy=False).view(np.int32) | ||
# force nan to quiet | ||
n = np.where((n & 0x7fffffff) > 0x7f800000, (n & 0xffff0000) | (64 << 16), n) | ||
# flush subnormals to zero | ||
n = np.where((n & 0x7f800000) == 0, n & 0x80000000, n) | ||
# round to nearest even | ||
n = (n + (0x7fff + ((n >> 16) & 1))) >> 16 | ||
return n.astype(np.int16) | ||
|
||
|
||
# This is faster than np.vectorize and np.apply_along_axis because it works on more than one row at a time | ||
def __apply_over_grouped_rows(func: Callable[[np.ndarray], np.ndarray], arr: np.ndarray, otype: DTypeLike, oshape: tuple[int, ...]) -> np.ndarray: | ||
rows = arr.reshape((-1, arr.shape[-1])) | ||
osize = 1 | ||
for dim in oshape: | ||
osize *= dim | ||
out = np.empty(shape=osize, dtype=otype) | ||
# compute over groups of 16 rows (arbitrary, but seems good for performance) | ||
n_groups = rows.shape[0] // 16 | ||
np.concatenate([func(group).ravel() for group in np.array_split(rows, n_groups)], axis=0, out=out) | ||
return out.reshape(oshape) | ||
|
||
|
||
def __quantize_bf16_array(n: np.ndarray) -> np.ndarray: | ||
return __apply_over_grouped_rows(__compute_fp32_to_bf16, arr=n, otype=np.int16, oshape=n.shape) | ||
|
||
|
||
__quantize_bf16_lazy = LazyNumpyTensor._wrap_fn(__quantize_bf16_array, meta_noop=np.int16) | ||
|
||
|
||
def quantize_bf16(n: np.ndarray): | ||
if type(n) is LazyNumpyTensor: | ||
return __quantize_bf16_lazy(n) | ||
else: | ||
return __quantize_bf16_array(n) | ||
|
||
|
||
__q8_block_size, __q8_type_size = GGML_QUANT_SIZES[GGMLQuantizationType.Q8_0] | ||
|
||
|
||
def can_quantize_to_q8_0(n: np.ndarray) -> bool: | ||
return n.shape[-1] % __q8_block_size == 0 | ||
|
||
|
||
# round away from zero | ||
# ref: https://stackoverflow.com/a/59143326/22827863 | ||
def np_roundf(n: np.ndarray) -> np.ndarray: | ||
a = abs(n) | ||
floored = np.floor(a) | ||
b = floored + np.floor(2 * (a - floored)) | ||
return np.sign(n) * b | ||
|
||
|
||
def __quantize_q8_0_shape_change(s: tuple[int, ...]) -> tuple[int, ...]: | ||
return (*s[:-1], s[-1] // __q8_block_size * __q8_type_size) | ||
|
||
|
||
# Implementation of Q8_0 with bit-exact same results as reference implementation in ggml-quants.c | ||
def __quantize_q8_0_rows(n: np.ndarray) -> np.ndarray: | ||
shape = n.shape | ||
assert shape[-1] % __q8_block_size == 0 | ||
|
||
n_blocks = n.size // __q8_block_size | ||
|
||
blocks = n.reshape((n_blocks, __q8_block_size)).astype(np.float32, copy=False) | ||
|
||
d = abs(blocks).max(axis=1, keepdims=True) / 127 | ||
with np.errstate(divide="ignore"): | ||
id = np.where(d == 0, 0, 1 / d) | ||
qs = np_roundf(blocks * id) | ||
|
||
# (n_blocks, 2) | ||
d = d.astype(np.float16).view(np.uint8) | ||
# (n_blocks, block_size) | ||
qs = qs.astype(np.int8).view(np.uint8) | ||
|
||
assert d.shape[1] + qs.shape[1] == __q8_type_size | ||
|
||
return np.concatenate([d, qs], axis=1).reshape(__quantize_q8_0_shape_change(shape)) | ||
|
||
|
||
def __quantize_q8_0_array(n: np.ndarray) -> np.ndarray: | ||
return __apply_over_grouped_rows(__quantize_q8_0_rows, arr=n, otype=np.uint8, oshape=__quantize_q8_0_shape_change(n.shape)) | ||
|
||
|
||
__quantize_q8_0_lazy = LazyNumpyTensor._wrap_fn( | ||
__quantize_q8_0_array, | ||
meta_noop=(np.uint8, __quantize_q8_0_shape_change), | ||
) | ||
|
||
|
||
def quantize_q8_0(data: np.ndarray): | ||
if type(data) is LazyNumpyTensor: | ||
return __quantize_q8_0_lazy(data) | ||
else: | ||
return __quantize_q8_0_array(data) |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has broken copying of tensors on i-quants (and probably several others as well), using
you now get
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue seems to be that the
type_size
is off by 2, however I don't see why the tensor should be reshaped in this scenario, so this should probably be re-evaluated.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for finding this!
I think it also breaks copying of all other quantized tensors in
gguf-new-metadata
.Sorry about that.
I think I found a way to fix this while also simplifying what happens to the shape in the round-trip between
GGUFReader
andGGUFWriter
. See #7483