Could performant GPU decompression algorithms increase memory? #4276
Replies: 3 comments 3 replies
-
The model data is not compressible - factor of 20x is impossible. You are better off just running it on the CPU. With mmap, it should dynamically load from disk whatever is needed |
Beta Was this translation helpful? Give feedback.
-
So how do you propose the data is compressed? Standard compression algorithms won't do much (7z best compression can only achieve ~4% compression). We already have a kind of compression byway of quants. The other issue, loading over the PCIe bus is SLOW. PCIe gen 4 x16 is about 30GB/s. That's slower than DDR4 3200 in dual channel mode. It's also not factoring in any overheads. Pretty much all LLM speed limitations come from memory bandwidth, not INT/FP calculations. |
Beta Was this translation helpful? Give feedback.
-
When you look at compression, in my view, 7zip/lempel-ziv and similar algorithms are made for documents mostly. It would certainly be possible to compress this stronger, you could group the tensor data by amplitude and tricks like that but the required decompression performance would make it worthless. After all you need the data in a defined structure to work with it. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
If we loaded a 180B f16 model to RAM, pre-compressed by a factor of 20 (18GB total), then sent pieces to the GPU for decompression (1GB->20gb) and ran inference, would we be able to run this massive model at a tolerable speed?
Beta Was this translation helpful? Give feedback.
All reactions