CUDA OOM in the middle of prompt processing #10240

jurov · 2024-11-09T18:02:29Z

jurov
Nov 9, 2024

I am trying to process large prompt. I tuned kv cache quantization, offloading as many layers as possible to GPU, it starts processing and all looks fine...and after few hours it fails with OOM.

nvidia-smi shows GPU memory usage by llama.cpp steadily creeping up. Is this expected behavior and if so how much reserve should I keep? Seems like 10% is needed.

Searched for similar discussions but the topic was allocation failure before submitting any prompt. This happens in the middle of processing.

slaren · 2024-11-09T18:07:34Z

slaren
Nov 9, 2024
Maintainer

There should be no allocations after the first few evaluations. Please include a log that shows the OOM error.

0 replies

jurov · 2024-11-09T18:13:25Z

jurov
Nov 9, 2024
Author

./llama3server supernova/SuperNova-Medius-IQ4_XS.gguf                                                                                                                                               
Running llama-server  --no-mmap --metrics --props --parallel 1 --slots --slot-save-path slots --sampling-seq mx --min-p 0.02 --flash-attn --ctx_size 131072 -ctk iq4_nl -ctv iq4_nl --batch-size 512 --n-gpu-la
yers 41 -m supernova/SuperNova-Medius-IQ4_XS.gguf                                                                                                                                                              
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no                                                                                                                                                                     
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no                                                                                                                                                                     
ggml_cuda_init: found 1 CUDA devices:                                                                                                                                                                          
  Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes                                                                                                                                          
build: 4059 (d102caa1) with cc (Gentoo 12.4.0 p3) 12.4.0 for x86_64-pc-linux-gnu                                                                                                                               
system info: n_threads = 16, n_threads_batch = 16, total_threads = 16                                                                                                                                          
                                                                                                                                                                                                               
system_info: n_threads = 16 (n_threads_batch = 16) / 16 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | AMX_INT8 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | 
ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |                                                          
                                                                                                                                                                                                               
main: HTTP server is listening, hostname: 127.0.0.1, port: 8080, http threads: 15                                                                                                                              
main: loading model                                                                                                                                                                                            
llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 3060) - 11926 MiB free
llama_model_loader: loaded meta data with 36 key-value pairs and 579 tensors from supernova/SuperNova-Medius-IQ4_XS.gguf (version GGUF V3 (latest))                                                            
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.                                                                                                              
llama_model_loader: - kv   0:                       general.architecture str              = qwen2                                                                                                              
llama_model_loader: - kv   1:                               general.type str              = model                                                                                                              
llama_model_loader: - kv   2:                               general.name str              = Qwen2.5 14B                                                                                                        
llama_model_loader: - kv   3:                       general.organization str              = Qwen                                                                                                               
llama_model_loader: - kv   4:                           general.basename str              = Qwen2.5                                                                                                            
llama_model_loader: - kv   5:                         general.size_label str              = 14B                                                                                                                
llama_model_loader: - kv   6:                            general.license str              = apache-2.0                                                                                                         
llama_model_loader: - kv   7:                   general.base_model.count u32              = 1                                                                                                                  
llama_model_loader: - kv   8:                  general.base_model.0.name str              = Qwen2.5 14B                                                                                                        
llama_model_loader: - kv   9:          general.base_model.0.organization str              = Qwen                                                                                                               
llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen2.5-14B                                                                            
llama_model_loader: - kv  11:                               general.tags arr[str,2]       = ["mergekit", "merge"]                                                                                              
llama_model_loader: - kv  12:                          qwen2.block_count u32              = 48                                                                                                                 
llama_model_loader: - kv  13:                       qwen2.context_length u32              = 131072                                                                                                             
llama_model_loader: - kv  14:                     qwen2.embedding_length u32              = 5120                                                                                                               
llama_model_loader: - kv  15:                  qwen2.feed_forward_length u32              = 13824                                                                                                              
llama_model_loader: - kv  16:                 qwen2.attention.head_count u32              = 40                                                                                                                 
llama_model_loader: - kv  17:              qwen2.attention.head_count_kv u32              = 8                                                                                                                  
llama_model_loader: - kv  18:                       qwen2.rope.freq_base f32              = 1000000.000000                                                                                                     
llama_model_loader: - kv  19:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000010                                                                                                           
llama_model_loader: - kv  20:                          general.file_type u32              = 30                                                                                                                 
llama_model_loader: - kv  21:                       tokenizer.ggml.model str              = gpt2                                                                                                               
llama_model_loader: - kv  22:                         tokenizer.ggml.pre str              = qwen2                                                                                                              
llama_model_loader: - kv  23:                      tokenizer.ggml.tokens arr[str,152064]  = ["!", "\"", "#", "$", "%", "&", "'", ...                                                                           
llama_model_loader: - kv  24:                  tokenizer.ggml.token_type arr[i32,152064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...                                                                           
llama_model_loader: - kv  25:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...                                                                                  
llama_model_loader: - kv  26:                tokenizer.ggml.eos_token_id u32              = 151645                                                                                                             
llama_model_loader: - kv  27:            tokenizer.ggml.padding_token_id u32              = 151643                                                                                                             
llama_model_loader: - kv  28:                tokenizer.ggml.bos_token_id u32              = 151643                                                                                                             
llama_model_loader: - kv  29:               tokenizer.ggml.add_bos_token bool             = false                                                                                                              
llama_model_loader: - kv  30:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...                                                                          
llama_model_loader: - kv  31:               general.quantization_version u32              = 2                                                                                                                  
llama_model_loader: - kv  32:                      quantize.imatrix.file str              = /models_out/SuperNova-Medius-GGUF/Sup...                                                                           
llama_model_loader: - kv  33:                   quantize.imatrix.dataset str              = /training_dir/calibration_datav3.txt                                                                               
llama_model_loader: - kv  34:             quantize.imatrix.entries_count i32              = 336                                                                                                                
llama_model_loader: - kv  35:              quantize.imatrix.chunks_count i32              = 128                                                                                                                
llama_model_loader: - type  f32:  241 tensors                                                                                                                                                                  
llama_model_loader: - type q5_K:   48 tensors                                                                                                                                                                  
llama_model_loader: - type q6_K:    1 tensors                                                                                                                                                                  
llama_model_loader: - type iq4_xs:  289 tensors                                                                                                                                                                
llm_load_vocab: special tokens cache size = 22                                                                                                                                                                 
llm_load_vocab: token to piece cache size = 0.9310 MB                                                                                                                                                          
llm_load_print_meta: format           = GGUF V3 (latest)                                                                                                                                                       
llm_load_print_meta: arch             = qwen2                                                                                                                                                                  
llm_load_print_meta: vocab type       = BPE                                                                                                                                                                    
llm_load_print_meta: n_vocab          = 152064                                                                                                                                                                 
llm_load_print_meta: n_merges         = 151387                                                                                                                                                                 
llm_load_print_meta: vocab_only       = 0                                                                                                                                                                      
llm_load_print_meta: n_ctx_train      = 131072                                                                                                                                                                 
llm_load_print_meta: n_embd           = 5120                                                                                                                                                                   
llm_load_print_meta: n_layer          = 48                                                                                                                                                                     
llm_load_print_meta: n_head           = 40     
lm_load_print_meta: n_head_kv        = 8                                                                                                                                                                      
llm_load_print_meta: n_rot            = 128                                                                                                                                                                    
llm_load_print_meta: n_swa            = 0                                                                                                                                                                      
llm_load_print_meta: n_embd_head_k    = 128                                                                                                                                                                    
llm_load_print_meta: n_embd_head_v    = 128                                                                                                                                                                    
llm_load_print_meta: n_gqa            = 5                                                                                                                                                                      
llm_load_print_meta: n_embd_k_gqa     = 1024                                                                                                                                                                   
llm_load_print_meta: n_embd_v_gqa     = 1024                                                                                                                                                                   
llm_load_print_meta: f_norm_eps       = 0.0e+00                                                                                                                                                                
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05                                                                                                                                                                
llm_load_print_meta: f_clamp_kqv      = 0.0e+00                                                                                                                                                                
llm_load_print_meta: f_max_alibi_bias = 0.0e+00                                                                                                                                                                
llm_load_print_meta: f_logit_scale    = 0.0e+00                                                                                                                                                                
llm_load_print_meta: n_ff             = 13824                                                                                                                                                                  
llm_load_print_meta: n_expert         = 0                                                                                                                                                                      
llm_load_print_meta: n_expert_used    = 0                                                                                                                                                                      
llm_load_print_meta: causal attn      = 1                                                                                                                                                                      
llm_load_print_meta: pooling type     = 0                                                                                                                                                                      
llm_load_print_meta: rope type        = 2                                                                                                                                                                      
llm_load_print_meta: rope scaling     = linear                                                                                                                                                                 
llm_load_print_meta: freq_base_train  = 1000000.0                                                                                                                                                              
llm_load_print_meta: freq_scale_train = 1                                                                                                                                                                      
llm_load_print_meta: n_ctx_orig_yarn  = 131072                                                                                                                                                                 
llm_load_print_meta: rope_finetuned   = unknown                                                                                                                                                                
llm_load_print_meta: ssm_d_conv       = 0                                                                                                                                                                      
llm_load_print_meta: ssm_d_inner      = 0                                                                                                                                                                      
llm_load_print_meta: ssm_d_state      = 0                                                                                                                                                                      
llm_load_print_meta: ssm_dt_rank      = 0                                                                                                                                                                      
llm_load_print_meta: ssm_dt_b_c_rms   = 0                                                                                                                                                                      
llm_load_print_meta: model type       = ?B                                                                                                                                                                     
llm_load_print_meta: model ftype      = IQ4_XS - 4.25 bpw                                                                                                                                                      
llm_load_print_meta: model params     = 14.77 B                                                                                                                                                                
llm_load_print_meta: model size       = 7.56 GiB (4.39 BPW)                                                                                                                                                    
llm_load_print_meta: general.name     = Qwen2.5 14B                                                                                                                                                            
llm_load_print_meta: BOS token        = 151643 '<|endoftext|>'                                                                                                                                                 
llm_load_print_meta: EOS token        = 151645 '<|im_end|>'                                                                                                                                                    
llm_load_print_meta: EOT token        = 151645 '<|im_end|>'                                                                                                                                                    
llm_load_print_meta: PAD token        = 151643 '<|endoftext|>'                                                                                                                                                 
llm_load_print_meta: LF token         = 148848 'ÄĬ'                                                                                                                                                            
llm_load_print_meta: FIM PRE token    = 151659 '<|fim_prefix|>'                                                                                                                                                
llm_load_print_meta: FIM SUF token    = 151661 '<|fim_suffix|>'                                                                                                                                                
llm_load_print_meta: FIM MID token    = 151660 '<|fim_middle|>'                                                                                                                                                
llm_load_print_meta: FIM PAD token    = 151662 '<|fim_pad|>'                                                                                                                                                   
llm_load_print_meta: FIM REP token    = 151663 '<|repo_name|>'                                                                                                                                                 
llm_load_print_meta: FIM SEP token    = 151664 '<|file_sep|>'                                                                                                                                                  
llm_load_print_meta: EOG token        = 151643 '<|endoftext|>'                                                                                                                                                 
llm_load_print_meta: EOG token        = 151645 '<|im_end|>'                                                                                                                                                    
llm_load_print_meta: EOG token        = 151662 '<|fim_pad|>'                                                                                                                                                   
llm_load_print_meta: EOG token        = 151663 '<|repo_name|>'                                                                                                                                                 
llm_load_print_meta: EOG token        = 151664 '<|file_sep|>'                                                                                                                                                  
llm_load_print_meta: max token length = 256                                                                                                                                                                    
llm_load_tensors: offloading 41 repeating layers to GPU                                                                                                                                                        
llm_load_tensors: offloaded 41/49 layers to GPU                                                                                                                                                                
llm_load_tensors:    CUDA_Host model buffer size =  1591.21 MiB                                                    
llama_kv_cache_init:      CUDA0 KV buffer size =  5904.00 MiB                                                                                                                                                  
llama_new_context_with_model: KV self size  = 6912.00 MiB, K (iq4_nl): 3456.00 MiB, V (iq4_nl): 3456.00 MiB                                                                                                    
llama_new_context_with_model:        CPU  output buffer size =     0.58 MiB                                                                                                                                    
llama_new_context_with_model:      CUDA0 compute buffer size =   202.02 MiB                                                                                                                                    
llama_new_context_with_model:        CPU compute buffer size =   307.00 MiB                                                                                                                                    
llama_new_context_with_model:  CUDA_Host compute buffer size =   398.01 MiB                                                                                                                                    
llama_new_context_with_model: graph nodes  = 1495                                                                                                                                                              
llama_new_context_with_model: graph splits = 123 (with bs=512), 3 (with bs=1)                                                                                                                                  
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)                                                                                                     
srv          init: initializing slots, n_slots = 1                                                                                                                                                             
slot         init: id  0 | task -1 | new slot n_ctx_slot = 131072                                                                                                                                              
main: model loaded                                                                                                                                                                                             
main: chat template, built_in: 1, chat_example: '<|im_start|>system                                                                                                                                            
You are a helpful assistant<|im_end|>                                                                                                                                                                          
<|im_start|>user                                                                                                                                                                                               
Hello<|im_end|>                                                                                                                                                                                                
<|im_start|>assistant                                                                                                                                                                                          
Hi there<|im_end|>                                                                                                                                                                                             
<|im_start|>user                                                                                                                                                                                               
How are you?<|im_end|>                                                                                                                                                                                         
<|im_start|>assistant                                                                                                                                                                                          
'                                                                                                                                                                                                              
main: server is listening on http://127.0.0.1:8080 - starting the main loop                                                                                                                                    
srv  update_slots: all slots are idle                                                                                                                                                                          
request: GET / 127.0.0.1 200                                                                                                                                                                                   
request: GET /deps_daisyui.min.css 127.0.0.1 200                                                                                                                                                               
request: GET /deps_markdown-it.js 127.0.0.1 200                                                                                                                                                                
request: GET /deps_tailwindcss.js 127.0.0.1 200                                                                                                                                                                
request: GET /completion.js 127.0.0.1 200                                                                                                                                                                      
request: GET /deps_vue.esm-browser.js 127.0.0.1 200                                                                                                                                                            
request: GET /favicon.ico 127.0.0.1 404                                                                                                                                                                        
slot launch_slot_: id  0 | task 0 | processing task                                                                                                                                                            
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 131072, n_keep = 0, n_prompt_tokens = 127632                                                                                                      
slot update_slots: id  0 | task 0 | kv cache rm [0, end)                                                                                                                                                       
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 512, n_tokens = 512, progress = 0.004012                                                                                              
slot update_slots: id  0 | task 0 | kv cache rm [512, end)                                                                                                                                                     
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 1024, n_tokens = 512, progress = 0.008023                                                                                             
slot update_slots: id  0 | task 0 | kv cache rm [1024, end)                                                                                                                                                    
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 1536, n_tokens = 512, progress = 0.012035                                                                                             
slot update_slots: id  0 | task 0 | kv cache rm [1536, end)
...snip...
slot update_slots: id  0 | task 0 | kv cache rm [11264, end)                                                                                                                                                   
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 11776, n_tokens = 512, progress = 0.092265                                                                                            
ggml/src/ggml-cuda.cu:70: CUDA error                                                                                                                                                                           
CUDA error: out of memory                                                                                                                                                                                      
  current device: 0, in function alloc at ggml/src/ggml-cuda.cu:344                                                                                                                                            
  cuMemCreate(&handle, reserve_size, &prop, 0)                                                                                                                                                                 
./llama3server: line 44: 10130 Aborted                 nice -n 5 ./llama.cpp/llama-server -t 16 $OPTS

1 reply

slaren Nov 9, 2024
Maintainer

This should be a small allocation, but the static buffers alone sum to 13018 MiB, which is already more than the free memory of your GPU. Also keep in mind that iq4_nl KV quantization is not supported by the CUDA backend, it will cause attention to run on the CPU at a significant performance cost.

jurov · 2024-11-09T18:53:31Z

jurov
Nov 9, 2024
Author

Thanks for answer. But if static buffers are more than free memory why won't it fail outright? It's such a waste of time.

And I actually plan to use the slot KV cache elsewhere with CPU-only inference so that's fine.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CUDA OOM in the middle of prompt processing #10240

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 3 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

CUDA OOM in the middle of prompt processing #10240

Uh oh!

Uh oh!

jurov Nov 9, 2024

Replies: 3 comments · 1 reply

Uh oh!

slaren Nov 9, 2024 Maintainer

Uh oh!

jurov Nov 9, 2024 Author

Uh oh!

slaren Nov 9, 2024 Maintainer

Uh oh!

jurov Nov 9, 2024 Author

jurov
Nov 9, 2024

Replies: 3 comments 1 reply

slaren
Nov 9, 2024
Maintainer

jurov
Nov 9, 2024
Author

slaren Nov 9, 2024
Maintainer

jurov
Nov 9, 2024
Author