[hack] Quantization w/o intermediate f32 converted model #7371
ochafik
started this conversation in
Show and tell
Replies: 1 comment
-
I'm kinda surprised that no one responded to this. I think you could use it to do the conversion faster by using multiple threads. I don't really code, so I have no idea specifically how, but could you create a sparse output file and map the memory for each tensor in the output file per thread? Something like that. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi all!
I was trying to convert & quantize files by myself (couldn't find a memory-mappable Mixtral 8x7b since #6387) and realized I didn't have enough disk space left 😓. Even the recently added direct Q8 quantization (#7234) eats lots of disk (besides, I wanted a Q4 w/o needless loss of quality).
So I did a (Unix-only) dirty hack that has
convert.py
quantize the model on the fly (using subprocess calls to a lightly modified./quantize
): see this branchHere's how it works:
temp-empty-f32.gguf
that has all the KVs and the tensor metadata, but no tensor data (tensor infos have bogus data offsets)./quantize --skeleton temp-empty-f32.gguf out-Q6_K.gguf Q6_K
: this writes everything toout-Q6_K.gguf
except the actual quantized tensors (left as zeroes).temp-single-f32.gguf
file (which needlessly also contains all KVs) and calling./quantize --single-tensor <tensor-name> temp-single-f32.gguf out-Q6_K.gguf Q6_K
. That--single-tensor
mode just memory-maps in writable mode the output GGUF and writes the quantized data of just that one vector.So, this is way too dirty to be mergeable in any form, but:
convert.py
(using something like the ggml Python bindings). I've got half of it working but not sure how useful it is in the grander scheme of things.One more thing: if you're wary of wearing off your SSD by repeatedly writing 2GB GGUF files w/ just a single tensor, you might want to create them... in RAM. Also, it's probably faster.
On Mac, the following creates a RAM-backed 4GB volume at
/Volumes/RAM Disk
(see this gist):So just replace
"temp-single-f32.gguf"
with"/Volumes/RAM Disk/temp-single-f32.gguf"
inconvert.py
and you're good to go.Beta Was this translation helpful? Give feedback.
All reactions