Why does the KV buffer always allocate on the GPU? #9226

rajesh-s · 2024-08-28T17:52:43Z

rajesh-s
Aug 28, 2024

I see that irrespective of the ngl value, the KV buffer is always allocated on the GPU. I want to have a n_ctx that is larger than the GPU VRAM and offload layers selectively, but I see that the buffer is always allocated on the KV not allowing this to happen

Command 1 (all gpu layers) : ./build/bin/llama-batched-bench -m models/llama-2-7b-Q4_0_4_8_aarch64.gguf --no-display-prompt --ignore-eos -fa -b 24576 -ub 24576 -c 24576 -npp 256 -ntg 256 -npl 48 -ngl 33 -t 72

Output 1:

Device 0: NVIDIA GH200 120GB, compute capability 9.0, VMM: yes
llm_load_tensors: ggml ctx size =    0.27 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:        CPU buffer size =    70.31 MiB
llm_load_tensors:      CUDA0 buffer size =  3577.56 MiB
..................................................................................................
llama_new_context_with_model: n_ctx      = 24576
llama_new_context_with_model: n_batch    = 24576
llama_new_context_with_model: n_ubatch   = 24576
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size = 12288.00 MiB
llama_new_context_with_model: KV self size  = 12288.00 MiB, K (f16): 6144.00 MiB, V (f16): 6144.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     5.86 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =  4992.09 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =  3081.28 MiB
llama_new_context_with_model: graph nodes  = 903
llama_new_context_with_model: graph splits = 450

main: n_kv_max = 24576, n_batch = 24576, n_ubatch = 24576, flash_attn = 1, is_pp_shared = 0, n_gpu_layers = 33, n_threads = 72, n_threads_batch = 72

|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|   256 |    256 |   48 |  24576 |   10.625 |  1156.56 |   22.435 |   547.72 |   33.059 |   743.39 |

Command 2 (no gpu layers): ./build/bin/llama-batched-bench -m models/llama-2-7b-Q4_0_4_8_aarch64.gguf --no-display-prompt --ignore-eos -fa -b 24576 -ub 24576 -c 24576 -npp 256 -ntg 256 -npl 48 -ngl 0 -t 72

Output 2:

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GH200 120GB, compute capability 9.0, VMM: yes
llm_load_tensors: ggml ctx size =    0.14 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/33 layers to GPU
llm_load_tensors:        CPU buffer size =  3647.87 MiB
..................................................................................................
llama_new_context_with_model: n_ctx      = 24576
llama_new_context_with_model: n_batch    = 24576
llama_new_context_with_model: n_ubatch   = 24576
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:  CUDA_Host KV buffer size = 12288.00 MiB
llama_new_context_with_model: KV self size  = 12288.00 MiB, K (f16): 6144.00 MiB, V (f16): 6144.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     5.86 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =  3486.54 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =  4608.28 MiB
llama_new_context_with_model: graph nodes  = 903
llama_new_context_with_model: graph splits = 196

main: n_kv_max = 24576, n_batch = 24576, n_ubatch = 24576, flash_attn = 1, is_pp_shared = 0, n_gpu_layers = 0, n_threads = 72, n_threads_batch = 72

|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|   256 |    256 |   48 |  24576 |   12.656 |   970.95 |   25.047 |   490.60 |   37.702 |   651.84 |

ExtReMLapin · 2024-08-28T18:28:38Z

ExtReMLapin
Aug 28, 2024

Use --no-kv-offload

1 reply

rajesh-s Sep 13, 2024
Author

I tried to use this but it does not seem to make a difference with batched-bench as noted here

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Why does the KV buffer always allocate on the GPU? #9226

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Why does the KV buffer always allocate on the GPU? #9226

Uh oh!

rajesh-s Aug 28, 2024

Replies: 1 comment · 1 reply

Uh oh!

ExtReMLapin Aug 28, 2024

Uh oh!

rajesh-s Sep 13, 2024 Author

rajesh-s
Aug 28, 2024

Replies: 1 comment 1 reply

ExtReMLapin
Aug 28, 2024

rajesh-s Sep 13, 2024
Author