Why does llama.cpp use so much VRAM (and RAM)? #9784

MislavJuric · 2024-10-08T07:07:10Z

MislavJuric
Oct 8, 2024

Hello everyone,

I recently started using llama.cpp and I have a question: why does llama.cpp use so much VRAM (GPU RAM) and RAM?

I have an 8 GB mobile GPU and I'm trying to run Gemma 2 9B quantized in Q4_K_M (this model).

Accroding to my calculations, the model should take up roughly:

(1 000 000 000 * 9 * 4) bits, or 4.5 gigabytes

so I would expect the GPU RAM usage to be somewhere in that ballpark.

However, when I run llama-server with the aforementioned GGUF file, I see that I am using 7.8 GB of VRAM and that I am using around 7 GB of RAM; why is that so? Can someone please explain?

I run the llama-server on Windows like this:

llama-server -m ../gemma-2-9b-it-Q4_K_M.gguf --n-gpu-layers 43 --port 8080

I have also tried using another 3B Q4_K_M quantized model and while it still uses all of the GPU memory it works much, much faster. So I guess the GPU utilization works, but I wonder why does llama.cpp use so much VRAM (and RAM)?

Thank you in advance!

Answered by slaren

Oct 16, 2024

Look at the messages printed while loading the model, llama.cpp will tell you the size of (almost) every backend buffer it allocates. The CUDA runtime also needs some memory that may not be accounted elsewhere.

View full answer

wooooyeahhhh · 2024-10-08T15:55:18Z

wooooyeahhhh
Oct 8, 2024

kv cache size. by default llama.cpp uses the max context size so you need to reduce it if you are out of memory.

with Gemma-9b by default it uses 8192 size so it uses about 2.8GB of memory, which while including the vram buffer used for the batch size, would add up to just less then 8GB

try something like -c 4096 in the args to use less memory

3 replies

MislavJuric Oct 9, 2024
Author

This doesn't help. My GPU memory is still all used up, alongside some RAM.

Should I try adjusting some other parameters as well?

wooooyeahhhh Oct 9, 2024

actually the q4_k_m of gemma 2 9b seems to be about 5,6GB not 4.5GB so it would still probably saturate most the gpu memory. also some ram seems to get buffered for whatever reason regardless of if the model is full offloaded. As long as you dont get an out of memory error then you should be fine

MislavJuric Oct 10, 2024
Author

OK, thanks for responding!

bosmart · 2024-10-14T07:57:33Z

bosmart
Oct 14, 2024

I'm trying to load this model https://huggingface.co/mariboo/Llama-3.2-70B-Instruct on a 2x80GB A100 and getting out of memory error (wants some extra 26GB on device0). The command is:

./llama-perplexity -m Llama-3.2-70B-Instruct-BF16.gguf -f wikitext-2-raw/wiki.test.raw -ngl 81 -fa -b 1 -ub 1

Can't quite get my head around the memory requirement here though:

Model weights: ~131.42GB
KV cache: 8192 (hidden size) * 512 (context length) * 2 (bfloat) * 80 (decoder layers) * 2 (K and V) = 1.215GB
Largest activation: MLP up_proj 0.3GB per batch element

So, what am I missing here?

1 reply

ggerganov Oct 14, 2024
Maintainer

The default context is 128K. Set it explicitly with -c 512. Also use batch size at least 512: -b 512 -ub 512

bosmart · 2024-10-14T08:11:45Z

bosmart
Oct 14, 2024

./llama-perplexity -h says -c, --ctx-size N size of the prompt context (default: 512, 0 = loaded from model) though?

Anyway, if I run with -c 512 -b 512 -ub 512, I'm still getting OOM:

llm_load_tensors: ggml ctx size =    1.02 MiB                                                                                                                                                                                                                              
llm_load_tensors: offloading 80 repeating layers to GPU                                                                                                                                                                                                                    
llm_load_tensors: offloading non-repeating layers to GPU                                                                                                                                                                                                                   
llm_load_tensors: offloaded 81/81 layers to GPU                                                                                                                                                                                                                            
llm_load_tensors:        CPU buffer size =  2004.00 MiB                                                                                                                                                                                                                    
llm_load_tensors:      CUDA0 buffer size = 66914.57 MiB                                                                                                                                                                                                                    
llm_load_tensors:      CUDA1 buffer size = 65654.48 MiB                                                                                                                                                                                                                    
....................................................................................................                                                                                                                                                                       
llama_new_context_with_model: n_ctx      = 512                                                                                                                                                                                                                             
llama_new_context_with_model: n_batch    = 512                                                                                                                                                                                                                             
llama_new_context_with_model: n_ubatch   = 512                                                                                                                                                                                                                             
llama_new_context_with_model: flash_attn = 1                                                                                                                                                                                                                               
llama_new_context_with_model: freq_base  = 500000.0                                                                                                                                                                                                                        
llama_new_context_with_model: freq_scale = 1                                                                                                                                                                                                                               
llama_kv_cache_init:      CUDA0 KV buffer size =    82.00 MiB                                                                                                                                                                                                              
llama_kv_cache_init:      CUDA1 KV buffer size =    78.00 MiB                                                                                                                                                                                                              
llama_new_context_with_model: KV self size  =  160.00 MiB, K (f16):   80.00 MiB, V (f16):   80.00 MiB                                                                                                                                                                      
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.49 MiB                                                                                                                                                                                                
llama_new_context_with_model: pipeline parallelism enabled (n_copies=4)                                                                                                                                                                                                    
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 27036.51 MiB on device 0: cudaMalloc failed: out of memory                                                                                                                                                          
ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 28349833216                                                                                                                                                                                                
llama_new_context_with_model: failed to allocate compute buffers

11 replies

bosmart Oct 14, 2024

But would be good to get some warning somewhere about BF16.

hariji814 Oct 28, 2024

On Mac, the compute buffer is just 266 MB for this model and parameters.

./llama-perplexity -m models/llama-70b-v3/ggml-model-f16.gguf -f build/wikitext-2-raw/wiki.test.raw -c 512 -b 512 -ub 512

0.00.273.783 I llm_load_tensors: offloading 80 repeating layers to GPU
0.00.273.784 I llm_load_tensors: offloading non-repeating layers to GPU
0.00.273.786 I llm_load_tensors: offloaded 81/81 layers to GPU
0.00.273.788 I llm_load_tensors:      Metal buffer size = 132569.04 MiB
0.00.273.788 I llm_load_tensors:        CPU buffer size =  2004.00 MiB
...................................................................................................
0.00.278.974 I llama_new_context_with_model: n_ctx      = 512
0.00.278.976 I llama_new_context_with_model: n_batch    = 512
0.00.278.976 I llama_new_context_with_model: n_ubatch   = 512
0.00.278.976 I llama_new_context_with_model: flash_attn = 0
0.00.278.977 I llama_new_context_with_model: freq_base  = 500000.0
0.00.278.978 I llama_new_context_with_model: freq_scale = 1
0.00.278.978 I ggml_metal_init: allocating
0.00.278.989 I ggml_metal_init: found device: Apple M2 Ultra
0.00.278.992 I ggml_metal_init: picking default device: Apple M2 Ultra
0.00.279.509 I ggml_metal_init: using embedded metal library
0.00.282.228 I ggml_metal_init: GPU name:   Apple M2 Ultra
0.00.282.231 I ggml_metal_init: GPU family: MTLGPUFamilyApple8  (1008)
0.00.282.231 I ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
0.00.282.232 I ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
0.00.282.232 I ggml_metal_init: simdgroup reduction support   = true
0.00.282.232 I ggml_metal_init: simdgroup matrix mul. support = true
0.00.282.233 I ggml_metal_init: hasUnifiedMemory              = true
0.00.282.233 I ggml_metal_init: recommendedMaxWorkingSetSize  = 154618.82 MB
0.00.298.817 I llama_kv_cache_init:      Metal KV buffer size =   160.00 MiB
0.00.298.822 I llama_new_context_with_model: KV self size  =  160.00 MiB, K (f16):   80.00 MiB, V (f16):   80.00 MiB
0.00.298.830 I llama_new_context_with_model:        CPU  output buffer size =     0.49 MiB
0.00.299.853 I llama_new_context_with_model:      Metal compute buffer size =   266.50 MiB
0.00.299.855 I llama_new_context_with_model:        CPU compute buffer size =    17.01 MiB
0.00.299.855 I llama_new_context_with_model: graph nodes  = 2566
0.00.299.855 I llama_new_context_with_model: graph splits = 2
0.00.299.856 W common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
0.33.133.298 I 
0.33.133.353 I system_info: n_threads = 16 (n_threads_batch = 16) / 24 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 1 | SVE = 0 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 1 | LLAMAFILE = 1 | 
0.33.133.374 I perplexity: tokenizing the input ..
0.34.122.556 I perplexity: tokenization took 989.181 ms
0.34.122.692 I perplexity: calculating perplexity over 635 chunks, n_ctx=512, batch_size=512, n_seq=1
0.37.425.528 I perplexity: 3.30 seconds per pass - ETA 34.95 minutes
[1]5.1778,[2]6.2737,^C

On Mac, the compute buffer is just 266 MB for this model and parameters.

./llama-perplexity -m models/llama-70b-v3/ggml-model-f16.gguf -f build/wikitext-2-raw/wiki.test.raw -c 512 -b 512 -ub 512

0.00.273.783 I llm_load_tensors: offloading 80 repeating layers to GPU
0.00.273.784 I llm_load_tensors: offloading non-repeating layers to GPU
0.00.273.786 I llm_load_tensors: offloaded 81/81 layers to GPU
0.00.273.788 I llm_load_tensors:      Metal buffer size = 132569.04 MiB
0.00.273.788 I llm_load_tensors:        CPU buffer size =  2004.00 MiB
...................................................................................................
0.00.278.974 I llama_new_context_with_model: n_ctx      = 512
0.00.278.976 I llama_new_context_with_model: n_batch    = 512
0.00.278.976 I llama_new_context_with_model: n_ubatch   = 512
0.00.278.976 I llama_new_context_with_model: flash_attn = 0
0.00.278.977 I llama_new_context_with_model: freq_base  = 500000.0
0.00.278.978 I llama_new_context_with_model: freq_scale = 1
0.00.278.978 I ggml_metal_init: allocating
0.00.278.989 I ggml_metal_init: found device: Apple M2 Ultra
0.00.278.992 I ggml_metal_init: picking default device: Apple M2 Ultra
0.00.279.509 I ggml_metal_init: using embedded metal library
0.00.282.228 I ggml_metal_init: GPU name:   Apple M2 Ultra
0.00.282.231 I ggml_metal_init: GPU family: MTLGPUFamilyApple8  (1008)
0.00.282.231 I ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
0.00.282.232 I ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
0.00.282.232 I ggml_metal_init: simdgroup reduction support   = true
0.00.282.232 I ggml_metal_init: simdgroup matrix mul. support = true
0.00.282.233 I ggml_metal_init: hasUnifiedMemory              = true
0.00.282.233 I ggml_metal_init: recommendedMaxWorkingSetSize  = 154618.82 MB
0.00.298.817 I llama_kv_cache_init:      Metal KV buffer size =   160.00 MiB
0.00.298.822 I llama_new_context_with_model: KV self size  =  160.00 MiB, K (f16):   80.00 MiB, V (f16):   80.00 MiB
0.00.298.830 I llama_new_context_with_model:        CPU  output buffer size =     0.49 MiB
0.00.299.853 I llama_new_context_with_model:      Metal compute buffer size =   266.50 MiB
0.00.299.855 I llama_new_context_with_model:        CPU compute buffer size =    17.01 MiB
0.00.299.855 I llama_new_context_with_model: graph nodes  = 2566
0.00.299.855 I llama_new_context_with_model: graph splits = 2
0.00.299.856 W common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
0.33.133.298 I 
0.33.133.353 I system_info: n_threads = 16 (n_threads_batch = 16) / 24 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 1 | SVE = 0 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 1 | LLAMAFILE = 1 | 
0.33.133.374 I perplexity: tokenizing the input ..
0.34.122.556 I perplexity: tokenization took 989.181 ms
0.34.122.692 I perplexity: calculating perplexity over 635 chunks, n_ctx=512, batch_size=512, n_seq=1
0.37.425.528 I perplexity: 3.30 seconds per pass - ETA 34.95 minutes
[1]5.1778,[2]6.2737,^C

master，when I run the same model in Windows or Linux, I see a difference. In Windows, the usage is 1588 MiB when inferring, but in Linux it is just 400 MiB. What happens to this model? Is it caused by the mmap() function? I want to need your reply. Thank you.

hariji814 Oct 29, 2024

@ggerganov

ggerganov Oct 29, 2024
Maintainer

Make sure to use the same context size on both machines. For example: -c 512

hariji814 Oct 29, 2024

i use a same python script in tow PC , the code is:

from memory_profiler import memory_usage
mem_before = memory_usage()[0]
but saw diff result,mmap can cause this so many result?

MislavJuric · 2024-10-16T11:50:25Z

MislavJuric
Oct 16, 2024
Author

I still have a question regarding this:

Today I measured my GPU usage via Task Manager. When I run llama-server with the model I mentioned in my original post (Gemma 2 9B quantized in Q4_K_M) like this:

llama-server -m ../gemma-2-9b-it-Q4_K_M.gguf --n-gpu-layers 41 --port 8080 -c 4096

I can see that I'm using up 7.1 GB of my VRAM. And this is not the enitre model; this is 41 out of 43 layers of the model (according to the output of the above command).

The model itself (its GGUF file) is around 5.63 GB (and it is not fully loaded into my GPU, as I only loaded 41 layers). My question is: where does the ~1.47 GB of VRAM go?

Tagging @wooooyeahhhh, @ggerganov and @slaren as I saw you were active on this discussion.

1 reply

slaren Oct 16, 2024
Maintainer

Look at the messages printed while loading the model, llama.cpp will tell you the size of (almost) every backend buffer it allocates. The CUDA runtime also needs some memory that may not be accounted elsewhere.

Answer selected by MislavJuric

Why does llama.cpp use so much VRAM (and RAM)? #9784

Uh oh!

Replies: 4 comments · 16 replies

Uh oh!

Uh oh!

Uh oh!

MislavJuric Oct 9, 2024 Author

Uh oh!

Uh oh!

MislavJuric Oct 10, 2024 Author

Uh oh!

Uh oh!

ggerganov Oct 14, 2024 Maintainer

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ggerganov Oct 29, 2024 Maintainer

Uh oh!

Uh oh!

MislavJuric Oct 16, 2024 Author

Uh oh!

slaren Oct 16, 2024 Maintainer

Replies: 4 comments 16 replies

MislavJuric Oct 9, 2024
Author

MislavJuric Oct 10, 2024
Author

ggerganov Oct 14, 2024
Maintainer

ggerganov Oct 29, 2024
Maintainer

MislavJuric
Oct 16, 2024
Author

slaren Oct 16, 2024
Maintainer