Replies: 1 comment
-
I'm using an Apple M2 Mac and I was astonished to find that almost zero memory is used. I'm still looking to see where the model is stored and then I found my system is caching 8.4GB of files in the system RAM . Macs what they call "unified RAM" so the data seems to never move, the GPU seems to be directly accessing the cached data. The Mac GPU has access to all of the system RAM. But of course if you are using Linux in an Intel CPU with Nvida on a PCI bus then the data would have to move through the Llama.cpp process and then across the PCIe bus to the VRAM in the card. The first question is how many rows are you telling llama.cpp to load into VRAM? This is controlled by a runtime parameter so you cam specify the amount of VRAM used from zero up to loading all that will fit. So the answer is that you get to specify how much VRAM is used and the rest of thre model runs on the CPU out of system RAM. I am now wondering if quanitized models have to be unpacked. Does a 5-bit parameter get unpacked into an 8-bit byte. Does the answer to this question depend on the exact GPU or CPU being used? |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Is it possible to deterministically predict the memory requirements (specifically interested in VRAM on Nvidia) that a model will consume?
I'm assuming it's something like: (n_parameters * datatype) + batch size overhead
How to also take into account quantization?
Beta Was this translation helpful? Give feedback.
All reactions