Generate UUID based on model weights? #7839

mofosyne · 2024-06-09T16:13:27Z

mofosyne
Jun 9, 2024
Collaborator

In #7499 (comment) I had a thought about generating a uuid that is based on the weights of the model rather than the DOI or repository url.

This is my attempt at this idea. It be interesting to see if this makes sense to others.

For the UUIDv5 this is how I calculate it:

uuidv5 namespace: ef001206-dadc-5f6d-a15f-3359e577d4e5
- Derived from uuid.uuid5(uuid.NAMESPACE_URL, 'en.wikipedia.org/wiki/Llama.cpp')
The hash byte string is based on a series of bytes
- 0x01 : if weight is > 0
- 0x00 : if weight is <= 0

The idea I have with this approach is that regardless of the quantitation applied the model will all have the same UUID. I am not sure if this is a good idea, but this is the approach I'm trying out for now. But happy to hear feedback if it makes sense to use this to identify a model in KV store e.g. general.uuid

#!/usr/bin/env python3
from __future__ import annotations

import sys
import uuid
import hashlib

import logging
import argparse
import os
import sys
from pathlib import Path

from tqdm import tqdm

# Necessary to load the local gguf package
if "NO_LOCAL_GGUF" not in os.environ and (Path(__file__).parent.parent.parent / 'gguf-py').exists():
    sys.path.insert(0, str(Path(__file__).parent.parent))

from gguf import GGUFReader, GGUFValueType  # noqa: E402

logger = logging.getLogger("gguf-dump")

# UUID_NAMESPACE_LLAMA_CPP = uuid.uuid5(uuid.NAMESPACE_URL, 'en.wikipedia.org/wiki/Llama.cpp')
UUID_NAMESPACE_LLAMA_CPP = uuid.UUID('ef001206-dadc-5f6d-a15f-3359e577d4e5')

# For more information about what field.parts and field.data represent,
# please see the comments in the modify_gguf.py example.
def gguf_hash(reader: GGUFReader, disable_progress_bar) -> None:
    md5 = hashlib.md5()
    sha1 = hashlib.sha1()
    uuidv5_sha1 = hashlib.sha1()
    uuidv5_sha1.update(UUID_NAMESPACE_LLAMA_CPP.bytes)

    # Total Weight Calculation For Progress Bar
    total_weights = 0
    for n, tensor in enumerate(reader.tensors, 1):

        # We don't need these
        if tensor.name.endswith((".attention.masked_bias", ".attention.bias", ".rotary_emb.inv_freq")):
            continue

        # Calculate Tensor Volume
        sum_weights_in_tensor = 1
        for dim in tensor.shape:
            sum_weights_in_tensor *= dim
        total_weights += sum_weights_in_tensor

    # Hash Progress Bar
    bar = tqdm(desc="Hashing", total=total_weights, unit="weights", unit_scale=True, disable=disable_progress_bar)

    # Hashing Process
    for n, tensor in enumerate(reader.tensors, 1):
        prettydims = ', '.join('{0:5}'.format(d) for d in list(tensor.shape) + [1] * (4 - len(tensor.shape)))

        # We don't need these
        if tensor.name.endswith((".attention.masked_bias", ".attention.bias", ".rotary_emb.inv_freq")):
            continue

        # Calculate Tensor Volume
        for layers in tensor.data:

            # Some layers have only one value. Convert into an iterable
            if not hasattr(layers, '__iter__'):
                layers = [layers]

            # Process every weight in a layer as if it was quantized to 1 bit
            # and aligned to a byte (for developer convenience) and hash it
            for weight in layers:
                packed_bytes = b'\x01' if weight > 0 else b'\x00'
                md5.update(packed_bytes)
                sha1.update(packed_bytes)
                uuidv5_sha1.update(packed_bytes)

            # Bar Update
            bar.update(len(layers))

    # Flush Hash Progress Bar
    bar.close()

    # Display Hash Output
    print("MD5: {0}".format(md5.hexdigest()))
    print("SHA1: {0}".format(sha1.hexdigest()))
    print("UUIDv5: {0}".format(uuid.UUID(bytes=uuidv5_sha1.digest()[:16], version=5)))

def main() -> None:
    parser = argparse.ArgumentParser(description="Dump GGUF file metadata")
    parser.add_argument("model",         type=str,            help="GGUF format model filename")
    parser.add_argument("--verbose",     action="store_true", help="increase output verbosity")
    parser.add_argument("--progressbar", action="store_true", help="enable progressbar")
    args = parser.parse_args(None if len(sys.argv) > 1 else ["--help"])
    logging.basicConfig(level=logging.DEBUG if args.verbose else logging.INFO)
    reader = GGUFReader(args.model, 'r')
    gguf_hash(reader, not args.progressbar)

if __name__ == '__main__':
    main()

So if I run this program this is what I expect:

$gguf-hash.py Tinyllama-5M-v0.2-F16.gguf
MD5: 3a1823d74a107c4f850fc530be2fa062
SHA1: 420e7558c26c2522382c79216edae8043295dc67
UUIDv5: d4a95412-1313-5259-82ab-bc43b92e5e57

$gguf-hash.py Tinyllama-5M-v0.2-BF16.gguf 
MD5: 0ba500e49e5cf18f0d148f7992ae19ce
SHA1: 85837786dbe6a6442afa873c7f23375037a43e50
UUIDv5: 884626ec-c6e4-5fa8-b7f1-93416d558b60

$gguf-hash.py Tinyllama-5M-v0.2-Q8_0.gguf 
MD5: 9df5dc2d92e59aa84e39a7db31fab18f
SHA1: e2abd5f278dea1de2085d95834ad311c3724d652
UUIDv5: ae6f53d3-4b4f-598c-b8f3-865fafbffa77

But for some reason... the hashes are still different... even though in theory if it's just a straight quantitation... the +/- of each weight should be the same? Anyway this is what I got to so far in my experimentation... but hopefully maybe someone can chime in on what may be the better approach (or to just not do it and instead just manually require model creators to generate their own UUID and be responsible for keeping it consistent).

mofosyne · 2024-06-10T01:11:25Z

mofosyne
Jun 10, 2024
Collaborator Author

oh... i forgot about superblocks and blocks... that's why. Gonna have to normalize it all first... hopefully there is functions for that already.

4 replies

compilade Jun 10, 2024
Collaborator

Note that the ReaderTensor.data of non f16 or f32 weights are in np.uint8, so the signs you get with your script are not the signs of the weights, even for bf16.

Q8_0 doesn't have superblocks, only blocks (k-quants have superblocks, though). There's an absolute scale in f16 (so it doesn't change the sign), and the weights are in int8, so the signs should be easy enough to extract.

But the signs of quantized weights can be different for weights that were originally very close to zero, because they are sometimes rounded to zero. You might need ternary, for some "close to zero but not quite" epsilon, and at this point full dequantization might be needed to check for proximity to zero.

But again, there will be outliers for any boundary chosen, so a different approach might be needed to truly get repeatable cross-quantization checksums. Maybe checking for the absmax position of each row? That should be kept mostly intact with quantization (except... when some low-bit quants alias maximums (is there a way around this, when there are multiple equal maximums where the original might not necessarily be the first due to rounding? Blocky maximums? Even less precision?)), though I don't know if there are enough possible values. Maybe mixing this data with dimensions and tensor counts could work?

Row-wise f16 absmax, by value not position. ~~This should work, since all quants use f16 scales for superblocks (I think?). Just find the absolute max scale for each row. Then checksum that. Lots of rows in a model, so it should be enough data.~~ Nevermind. The scales are divided by 127 (or some other values, need to check that), so it's not lossless.

Or.... Maybe this is fine because the convert scripts have access to the original weights anyway, which are not quantized?

mofosyne Jun 11, 2024
Collaborator Author

https://github.com/ggerganov/ggml/blob/master/docs/gguf.md

By the way also trying to investigate what it would take to cleanly hash all the tensors (if we don't care about distinguish between quantisation). gguf_tensor_info_t doesn't seem to show any method for knowing the size of the opaque tensor binary object... is there a method to understanding the size of the hashing area I should hash?

(Also... if I could do per layer hash, this opens up opportunity during CI to prevent regression at a more granular level by checking if test output matches layer hash... so we can identify exactly which layer has a mismatch without storing the full test expected object)

compilade Jun 11, 2024
Collaborator

By the way also trying to investigate what it would take to cleanly hash all the tensors (if we don't care about distinguish between quantisation). gguf_tensor_info_t doesn't seem to show any method for knowing the size of the opaque tensor binary object... is there a method to understanding the size of the hashing area I should hash?

Oh, that's relatively easy. In Python, GGUFReader already gives a list of ReaderTensor which you can use a bit like you did. tensor.data.data.tobytes() should get you the bytes of the tensor if tensor is a ReaderTensor.

Otherwise, you need to use the block size and type size of the tensor's quantization type to get the correct number of bytes.

The formula is (n_elems * type_size) / block_size to get the number of bytes from the tensor dimensions.

(Also... if I could do per layer hash, this opens up opportunity during CI to prevent regression at a more granular level by checking if test output matches layer hash... so we can identify exactly which layer has a mismatch without storing the full test expected object)

Makes me think that there could be a hash-only GGUFWriter mode to reduce the need to store the model if only the hash is wanted. It could output both file-level and tensor-wise hashes. I might work on this if I manage my time well enough. But a separate script like gguf-hash.py sounds like a great idea. Keep investigating and experimenting, this direction seems useful.

mofosyne Jun 11, 2024
Collaborator Author

Do you think we could possibly push to modify the gguf standard to carry stuff like layer hashes, tensor size and maybe for bonus cookie... the tensor structure? This would decouple the need for gguf-diff to always know future tensor structures to validate a model.

With this approach, you could even potentially create a tensorless gguf with only the tensor info with the hash. So that we are using exisiting gguf infrastructure but just validating no hash changes per layer (but arguably this really should be a Json file so people can visually see hash changes in git).

Also the reason I am also asking about the C structure is because I think to trust the diff and uuid generation process you kinda need two implementations in both C and python to be confident in committing to a standard model hash algorithm.

Had a look at block_size, type_size = GGML_QUANT_SIZES[ggml_type] in _build_tensors() and am a bit disappointed that indeed the python reader needs to know the block size ahead of time to get the size of the tensor. I would really like to push to include block size for tensor elements if possible to make futureproof hash checking possible.

Edit: Decided to throw some wishlist to #7891

mofosyne · 2024-06-23T12:55:02Z

mofosyne
Jun 23, 2024
Collaborator Author

@compilade I decided to give a stab at a simpler per tensor naive hashing for the UUIDv5 generation in #8048

It compiles... on my PC, but just can't figure out what I'm doing wrong with the windows CI checks. (FYI my PR is extremely naive and won't handle split ggufs files... but good start I guess?)

But if this works out, it can be a good base to sanity check your proposed change to GGUFWriter to support a hash-only mode during conversion to precalcuate the consistent model UUID.

0 replies

ngxson · 2024-06-24T14:34:35Z

ngxson
Jun 24, 2024
Collaborator

Just thinking, could we try quantizing the model using Qx_K, then calculate the hash purely based on scale of each block? (we don't care about the individual weights inside each block).

3 replies

compilade Jun 24, 2024
Collaborator

I've tried something similar using only the sign of the biggest scale of an entire row, and it almost worked, but there were a few differences still which made the hashes different (f16, bf16, and f32 had the same hash, but Q8_0 was different in 5 bits in 100kB of row max signs in a 1.8B model).

I'm not sure if doing that block-wise (256 elements at a time?) would make it work. I would expect there would be even more differences that could appear.

It's not really obvious how to find something to measure which reduces false differences as much as possible. Even dimensionality reduction with random projection almost works but not quite.

https://en.wikipedia.org/wiki/Locality-sensitive_hashing might be relevant if you want to go deeper in this rabbit hole.

ngxson Jun 24, 2024
Collaborator

Hmm, I'm thinking a bit further.. the "buckets" method that we're using (2 bucket: negative & positive sign) maybe doesn't work at all.

The problem of hashing a model is not different from hashing an image. One image can be in jpg, png, gif, uses different color space, maybe badly compressed, etc. But in the end, to our eyes, a pixel having color #ff0000 is not different from one with #fe0101. Probably we can see if there are other fuzzy checksum techniques that is used for image hashing, then we can try to adapt it?

Edit: Unrelated, but I've just discovered that github shows color dot when a color hex is mentioned 👀 here is blue #0000ff

mofosyne Jun 24, 2024
Collaborator Author

Yeah, I kinda came to the same conclusion, hence my new current approach is just to make a per 'model + quant level" hash... which is still useful enough by itself. You can argue that a 'smaller' jpeg is in practice a derived object and should have a different ID anyway.

But locally sensitive hashing will have it's place eventually, just as people do end up using TinEye... maybe let this field mature a bit more and we can at least then understand what to apply later... e.g. convert to float, then arrange in a hibert curve, shrink to smaller 'average image' and then apply other technique from https://en.wikipedia.org/wiki/Locality-sensitive_hashing to do the job. (Though, local sensitive hashing don't have a UUID equivalent)

For now... let's stick to the MVP of simple hashing. (Easier to test as well)

mofosyne · 2024-07-07T15:39:52Z

mofosyne
Jul 7, 2024
Collaborator Author

Fyi llama-gguf-hash C program now in the example folder and is now ready to use on top of main.

Readme provided in that folder if you want usage examples.

0 replies

ngxson · 2024-07-12T15:00:58Z

ngxson
Jul 12, 2024
Collaborator

While working with some LoRA recently, I'm thinking if it possible to have a hash method that measure the "distance" between models.

An use case that I need atm is to check if distance of these 2 ggufs are within an acceptable range:

A HF model merged with a HF LoRA adapter, then converted to gguf
A HF model converted to gguf, HF LoRA converted to gguf, then 2 ggufs being merged into one gguf

This can also be useful to test how far between Q4 to f16 of a particular model for example. (remind: the way we're measuring now is to compare perplexity, which requires to run the inference)

One of the idea I'm having in my mind is flatten all weights into 1d vector, then calculate sum of sigmoid of each element:

# pseudo-code
sum = 0
for t in tensors:
  v = reshape_to_1d(t)
  for x in v:
    sum += sigmoid(x)

A model having 7b parameters should have max sum = 7b, but I expect actual value to fall around 0 (median)

Edit: this should work under one condition: most of weights value must fall inside -1 and 1

4 replies

compilade Jul 12, 2024
Collaborator

I think hashes are not the right thing for measuring distance. Some kind of different tool, maybe called gguf-diff would be more appropriate to measure the differences between two models (cosine similarity could be useful too).

Now I'm thinking it might be possible to build a LoRA between a quantized model and its full-precision version, (by subtraction and SVD), to alleviate the quantization errors.

ngxson Jul 12, 2024
Collaborator

I mean, the reason why I consider it a "hash" is because it can still possibly be used for integrity check.

My idea above is in fact quite oversimplified, but I was thinking that in signal processing, we can have a spectrogram as kind of "signature" of an audio. The same audio signal but different bit rate will still have more or less the same spectrogram (after all, that's how youtube copyright works huh?)

For example, we can reshape the weights to list of vectors of N elements, then calculate mean of these vectors. The output vector could be use to check integrity, I just have no idea if this will introduce a lot of collisions.

Now I'm thinking it might be possible to build a LoRA between a quantized model and its full-precision version, (by subtraction and SVD), to alleviate the quantization errors.

This reminds me a bit about IQ quant family. But to be honest I haven't had time to look into the implementation

And btw, I came across LoRD last week, probably interesting for your idea.

compilade Jul 12, 2024
Collaborator

I mean, the reason why I consider it a "hash" is because it can still possibly be used for integrity check.

My idea above is in fact quite oversimplified, but I was thinking that in signal processing, we can have a spectrogram as kind of "signature" of an audio. The same audio signal but different bit rate will still have more or less the same spectrogram (after all, that's how youtube copyright works huh?)

Even with fuzzy hashing, you still need the fuzzy hashes of both files to check if they're close enough. (Hamming distance or something). Otherwise, it's too probable for the fuzzy hashes of even very similar weights to not match exactly. (This also applies to spectrograms)

For example, we can reshape the weights to list of vectors of N elements, then calculate mean of these vectors.

Tried something similar, the mean doesn't work well with normalized vectors to evaluate (dis)similarity. The sign of the max value of each row was the closest to a good source of (dis)similarty in my tests. Although for quants smaller than Q8_0, this may no longer be true.

This reminds me a bit about IQ quant family. But to be honest I haven't had time to look into the implementation

The IQ quants don't really use low-rank matrices. From what I understand, the importance matrix uses a vector per tensor (except for stacked experts where it's N vectors for N experts), and those vectors are the column-wise mean of things. And then that only affects which errors should be more minimized. Doing something with low-rank tensors should be even better, but it might use more bits (unless used with a very low bpw quant type). Still, it's something I'd like to eventually try.

And btw, I came across LoRD last week, probably interesting for your idea.

This is exacly using subtraction and SVD. Nice to see that it's indeed a way to do this.

mofosyne Jul 13, 2024
Collaborator Author

I recall someone thinking about using image compression algorithms for weights. Unsure if it makes sense as a compression approach.

But it does bring to mind if we use a Hubert curve so it becomes like an image then use an existing image similarity hash algorithm, we could test the idea of similarity checking out.

That being said this thread was closed because the exact hash approach is already baked into llama-gguf-hash

I think it's worth it for you to make a new thread focusing on this idea of a relative hash approach.

Generate UUID based on model weights? #7839

Uh oh!

Uh oh!

mofosyne Jun 9, 2024 Collaborator

Replies: 5 comments · 11 replies

Uh oh!

Uh oh!

mofosyne Jun 10, 2024 Collaborator Author

Uh oh!

Uh oh!

compilade Jun 10, 2024 Collaborator

Uh oh!

Uh oh!

mofosyne Jun 11, 2024 Collaborator Author

Uh oh!

Uh oh!

compilade Jun 11, 2024 Collaborator

Uh oh!

Uh oh!

mofosyne Jun 11, 2024 Collaborator Author

Uh oh!

mofosyne Jun 23, 2024 Collaborator Author

Uh oh!

ngxson Jun 24, 2024 Collaborator

Uh oh!

compilade Jun 24, 2024 Collaborator

Uh oh!

Uh oh!

ngxson Jun 24, 2024 Collaborator

Uh oh!

Uh oh!

mofosyne Jun 24, 2024 Collaborator Author

Uh oh!

Uh oh!

mofosyne Jul 7, 2024 Collaborator Author

Uh oh!

Uh oh!

ngxson Jul 12, 2024 Collaborator

Uh oh!

Uh oh!

compilade Jul 12, 2024 Collaborator

Uh oh!

ngxson Jul 12, 2024 Collaborator

Uh oh!

Uh oh!

compilade Jul 12, 2024 Collaborator

Uh oh!

Uh oh!

mofosyne Jul 13, 2024 Collaborator Author

mofosyne
Jun 9, 2024
Collaborator

Replies: 5 comments 11 replies

mofosyne
Jun 10, 2024
Collaborator Author

compilade Jun 10, 2024
Collaborator

mofosyne Jun 11, 2024
Collaborator Author

compilade Jun 11, 2024
Collaborator

mofosyne Jun 11, 2024
Collaborator Author

mofosyne
Jun 23, 2024
Collaborator Author

ngxson
Jun 24, 2024
Collaborator

compilade Jun 24, 2024
Collaborator

ngxson Jun 24, 2024
Collaborator

mofosyne Jun 24, 2024
Collaborator Author

mofosyne
Jul 7, 2024
Collaborator Author

ngxson
Jul 12, 2024
Collaborator

compilade Jul 12, 2024
Collaborator

ngxson Jul 12, 2024
Collaborator

compilade Jul 12, 2024
Collaborator

mofosyne Jul 13, 2024
Collaborator Author