You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hey #LLM Experts when looking at the GGUF format there are multiple blocks with the same size / same format. would it be possible to use something like nvCOMP to compress those to save vram? (i know huge overhead) but more cheap than buying a h100!
Here a examples on the compression itself.
Recommended compression would be LZ4 HC -9 , it has a compression factor of 2.721 (slow to compress, but very very fast to uncompress) Only uncompression has to be fast, as the compression it self would only be done one time when creating the model
What i was thinking is that every block gets compressed with nvCOMP and then uploaded to the GPU.
The first block gets uncompressed, math matmuls operated, then the uncompressed data gets discarded.... then the next block and so forth. (compression could be either applied to blocks, or even smaller parts)
from doing a test on just compressing the safetensor files they can be compressed between 20-30%! It realy depends on the layers. The first 10 layers most of the time only achive 20%, after that the compression increases for every layer to the end.
Keep in mind this would be lossless compression, therefore 0 loss compared to quantization.
This would allow for example running llama 3.1 70B with q4_0 AND this compression scheme to run on a single 24GB vram home gpu.
Most of the people use only a single stream and use only 14% of the gpu (dependend on the model, as the memory speed is the limit not compute) compute available when using llm's, so this idea would only be greate for people doing serial requests not parallel requests.
Another way instead of blocks would be to use a "stream" compression scheme instead of blocks...
Just a idea as GPU VRAM is extremely expensive :-(
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hey #LLM Experts when looking at the GGUF format there are multiple blocks with the same size / same format. would it be possible to use something like nvCOMP to compress those to save vram? (i know huge overhead) but more cheap than buying a h100!
Here a examples on the compression itself.
sample
https://www.youtube.com/watch?v=i8ai0tWhV-0
Recommended compression would be LZ4 HC -9 , it has a compression factor of 2.721 (slow to compress, but very very fast to uncompress) Only uncompression has to be fast, as the compression it self would only be done one time when creating the model
What i was thinking is that every block gets compressed with nvCOMP and then uploaded to the GPU.
The first block gets uncompressed, math matmuls operated, then the uncompressed data gets discarded.... then the next block and so forth. (compression could be either applied to blocks, or even smaller parts)
from doing a test on just compressing the safetensor files they can be compressed between 20-30%! It realy depends on the layers. The first 10 layers most of the time only achive 20%, after that the compression increases for every layer to the end.
Keep in mind this would be lossless compression, therefore 0 loss compared to quantization.
This would allow for example running llama 3.1 70B with q4_0 AND this compression scheme to run on a single 24GB vram home gpu.
Most of the people use only a single stream and use only 14% of the gpu (dependend on the model, as the memory speed is the limit not compute) compute available when using llm's, so this idea would only be greate for people doing serial requests not parallel requests.
Another way instead of blocks would be to use a "stream" compression scheme instead of blocks...
Just a idea as GPU VRAM is extremely expensive :-(
Beta Was this translation helpful? Give feedback.
All reactions