Uncompressing Blocks when required to save vram #8731

snapo · 2024-07-28T02:49:08Z

snapo
Jul 28, 2024

Hey #LLM Experts when looking at the GGUF format there are multiple blocks with the same size / same format. would it be possible to use something like nvCOMP to compress those to save vram? (i know huge overhead) but more cheap than buying a h100!
Here a examples on the compression itself.

sample
https://www.youtube.com/watch?v=i8ai0tWhV-0

Recommended compression would be LZ4 HC -9 , it has a compression factor of 2.721 (slow to compress, but very very fast to uncompress) Only uncompression has to be fast, as the compression it self would only be done one time when creating the model

What i was thinking is that every block gets compressed with nvCOMP and then uploaded to the GPU.

The first block gets uncompressed, math matmuls operated, then the uncompressed data gets discarded.... then the next block and so forth. (compression could be either applied to blocks, or even smaller parts)

from doing a test on just compressing the safetensor files they can be compressed between 20-30%! It realy depends on the layers. The first 10 layers most of the time only achive 20%, after that the compression increases for every layer to the end.

Keep in mind this would be lossless compression, therefore 0 loss compared to quantization.
This would allow for example running llama 3.1 70B with q4_0 AND this compression scheme to run on a single 24GB vram home gpu.

Most of the people use only a single stream and use only 14% of the gpu (dependend on the model, as the memory speed is the limit not compute) compute available when using llm's, so this idea would only be greate for people doing serial requests not parallel requests.

Another way instead of blocks would be to use a "stream" compression scheme instead of blocks...

Just a idea as GPU VRAM is extremely expensive :-(

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uncompressing Blocks when required to save vram #8731

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Uncompressing Blocks when required to save vram #8731

Uh oh!

Uh oh!

snapo Jul 28, 2024

Replies: 0 comments

snapo
Jul 28, 2024