Could performant GPU decompression algorithms increase memory? #4276

BarfingLemurs · 2023-12-01T04:48:56Z

BarfingLemurs
Dec 1, 2023

Inspired by a discussion from the localllama subreddit.

my question is not identical to the idea explored above, which uses high bandwidth storage

If we loaded a 180B f16 model to RAM, pre-compressed by a factor of 20 (18GB total), then sent pieces to the GPU for decompression (1GB->20gb) and ran inference, would we be able to run this massive model at a tolerable speed?

ggerganov · 2023-12-01T06:36:04Z

ggerganov
Dec 1, 2023
Maintainer

The model data is not compressible - factor of 20x is impossible.

You are better off just running it on the CPU. With mmap, it should dynamically load from disk whatever is needed

2 replies

BarfingLemurs Dec 1, 2023
Author

The model data is not compressible - factor of 20x is impossible.

I see.. Maybe if there is a lossy way to reconstruct the model layer(s)? But that hurts the whole pursuit of "quality" with massive models.

easp Dec 2, 2023

Model quantization is a type of lossy compression. There are various approaches to minimize quality loss by adjusting the degree of quantization for various parts of the model.

askmyteapot · 2023-12-01T06:36:09Z

askmyteapot
Dec 1, 2023

pre-compressed by a factor of 20

So how do you propose the data is compressed? Standard compression algorithms won't do much (7z best compression can only achieve ~4% compression). We already have a kind of compression byway of quants.

The other issue, loading over the PCIe bus is SLOW. PCIe gen 4 x16 is about 30GB/s. That's slower than DDR4 3200 in dual channel mode. It's also not factoring in any overheads.

Pretty much all LLM speed limitations come from memory bandwidth, not INT/FP calculations.

1 reply

BarfingLemurs Dec 1, 2023
Author

If the PCIe bus is slow, then if a decompression or some transformation with lowdata were able to let the gpu inference the next layer, it could overcome that.

I'm not saying have any knowledge such a thing exists yet. But to make use of the bandwidth of high end GPUs with their vram limits would be great, and it would achieve something similar to 192gb Mac studio

cmp-nct · 2023-12-01T13:31:41Z

cmp-nct
Dec 1, 2023

When you look at compression, in my view, 7zip/lempel-ziv and similar algorithms are made for documents mostly.
llama.cpp is basically built upon compression, it's just called quantization when it comes to numbers, quantization compresses a number from 16 or 32 bit down to a few bit and uses the information of surrounding numbers to compress stronger.

It would certainly be possible to compress this stronger, you could group the tensor data by amplitude and tricks like that but the required decompression performance would make it worthless. After all you need the data in a defined structure to work with it.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Could performant GPU decompression algorithms increase memory? #4276

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Could performant GPU decompression algorithms increase memory? #4276

Uh oh!

BarfingLemurs Dec 1, 2023

Replies: 3 comments · 3 replies

Uh oh!

Uh oh!

ggerganov Dec 1, 2023 Maintainer

Uh oh!

BarfingLemurs Dec 1, 2023 Author

Uh oh!

easp Dec 2, 2023

Uh oh!

askmyteapot Dec 1, 2023

Uh oh!

BarfingLemurs Dec 1, 2023 Author

Uh oh!

cmp-nct Dec 1, 2023

BarfingLemurs
Dec 1, 2023

Replies: 3 comments 3 replies

ggerganov
Dec 1, 2023
Maintainer

BarfingLemurs Dec 1, 2023
Author

askmyteapot
Dec 1, 2023

BarfingLemurs Dec 1, 2023
Author

cmp-nct
Dec 1, 2023