Why does llama.cpp use so much VRAM (and RAM)? #9784
-
Hello everyone, I recently started using llama.cpp and I have a question: why does llama.cpp use so much VRAM (GPU RAM) and RAM? I have an 8 GB mobile GPU and I'm trying to run Gemma 2 9B quantized in Q4_K_M (this model). Accroding to my calculations, the model should take up roughly: (1 000 000 000 * 9 * 4) bits, or 4.5 gigabytes so I would expect the GPU RAM usage to be somewhere in that ballpark. However, when I run llama-server with the aforementioned GGUF file, I see that I am using 7.8 GB of VRAM and that I am using around 7 GB of RAM; why is that so? Can someone please explain? I run the llama-server on Windows like this:
I have also tried using another 3B Q4_K_M quantized model and while it still uses all of the GPU memory it works much, much faster. So I guess the GPU utilization works, but I wonder why does llama.cpp use so much VRAM (and RAM)? Thank you in advance! |
Beta Was this translation helpful? Give feedback.
Replies: 4 comments 16 replies
-
kv cache size. by default llama.cpp uses the max context size so you need to reduce it if you are out of memory. with Gemma-9b by default it uses 8192 size so it uses about 2.8GB of memory, which while including the vram buffer used for the batch size, would add up to just less then 8GB try something like |
Beta Was this translation helpful? Give feedback.
-
I'm trying to load this model https://huggingface.co/mariboo/Llama-3.2-70B-Instruct on a 2x80GB A100 and getting out of memory error (wants some extra 26GB on device0). The command is:
Can't quite get my head around the memory requirement here though:
So, what am I missing here? |
Beta Was this translation helpful? Give feedback.
-
Anyway, if I run with
|
Beta Was this translation helpful? Give feedback.
-
I still have a question regarding this: Today I measured my GPU usage via Task Manager. When I run llama-server with the model I mentioned in my original post (Gemma 2 9B quantized in Q4_K_M) like this:
I can see that I'm using up 7.1 GB of my VRAM. And this is not the enitre model; this is 41 out of 43 layers of the model (according to the output of the above command). The model itself (its GGUF file) is around 5.63 GB (and it is not fully loaded into my GPU, as I only loaded 41 layers). My question is: where does the ~1.47 GB of VRAM go? Tagging @wooooyeahhhh, @ggerganov and @slaren as I saw you were active on this discussion. |
Beta Was this translation helpful? Give feedback.
Look at the messages printed while loading the model, llama.cpp will tell you the size of (almost) every backend buffer it allocates. The CUDA runtime also needs some memory that may not be accounted elsewhere.