Skip to content

Misc. bug: out of memory error after PR #13746 #14740

@socram8888

Description

@socram8888

Name and Version

Tested on latest master, and multiple previous versions. Currently at:

version: 1313 (086cf81)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu

Operating systems

Linux

Which llama.cpp modules do you know to be affected?

llama-server

Command line

CUDA_VISIBLE_DEVICES=2 llama-server -m models/Phi-4-mini-instruct.BF16.gguf -ngl 80 --host :: --port 31420 --jinja --ctx-size 0 --no-kv-offload -t 2

Problem description & steps to reproduce

Hello. I've updated today from f125b8d and llama-server is now throwing out of memory errors. I've backtracked and traced the issue to PR #13746. I am not familiar with the codebase so not sure what exactly this changed.

The server has 256GB RAM and 4x old Nvidia M60 cards, Ubuntu 22.04 with CUDA 12.9. I'm running phi-4-mini-instruct which until now this commit fit narrowly by offloading KV cache to CPU.

Commit eb39499 was the last one that worked:

# 
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: Tesla M60, compute capability 5.2, VMM: yes
build: 941 (eb39499) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
system info: n_threads = 2, n_threads_batch = 2, total_threads = 56

system_info: n_threads = 2 (n_threads_batch = 2) / 56 | CUDA : ARCHS = 500,610,700,750,800,860,890 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |

main: binding port with default address family
main: HTTP server is listening, hostname: ::, port: 31420, http threads: 55
main: loading model
srv    load_model: loading model '../models/Phi-4-mini-instruct.BF16.gguf'
llama_model_load_from_file_impl: using device CUDA0 (Tesla M60) - 8040 MiB free
llama_model_loader: loaded meta data with 35 key-value pairs and 196 tensors from ../models/Phi-4-mini-instruct.BF16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = phi3
llama_model_loader: - kv   1:              phi3.rope.scaling.attn_factor f32              = 1.190238
llama_model_loader: - kv   2:                               general.type str              = model
llama_model_loader: - kv   3:                               general.name str              = Phi 4 Mini Instruct
llama_model_loader: - kv   4:                       general.organization str              = Microsoft
llama_model_loader: - kv   5:                           general.finetune str              = instruct
llama_model_loader: - kv   6:                           general.basename str              = Phi-4
llama_model_loader: - kv   7:                       general.quantized_by str              = Unsloth
llama_model_loader: - kv   8:                         general.size_label str              = mini
llama_model_loader: - kv   9:                           general.repo_url str              = https://huggingface.co/unsloth
llama_model_loader: - kv  10:                        phi3.context_length u32              = 131072
llama_model_loader: - kv  11:  phi3.rope.scaling.original_context_length u32              = 4096
llama_model_loader: - kv  12:                      phi3.embedding_length u32              = 3072
llama_model_loader: - kv  13:                   phi3.feed_forward_length u32              = 8192
llama_model_loader: - kv  14:                           phi3.block_count u32              = 32
llama_model_loader: - kv  15:                  phi3.attention.head_count u32              = 24
llama_model_loader: - kv  16:               phi3.attention.head_count_kv u32              = 8
llama_model_loader: - kv  17:      phi3.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  18:                  phi3.rope.dimension_count u32              = 96
llama_model_loader: - kv  19:                        phi3.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  20:                          general.file_type u32              = 32
llama_model_loader: - kv  21:              phi3.attention.sliding_window u32              = 262144
llama_model_loader: - kv  22:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  23:                         tokenizer.ggml.pre str              = gpt-4o
llama_model_loader: - kv  24:                      tokenizer.ggml.tokens arr[str,200064]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  25:                  tokenizer.ggml.token_type arr[i32,200064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  26:                      tokenizer.ggml.merges arr[str,199742]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "e r", ...
llama_model_loader: - kv  27:                tokenizer.ggml.bos_token_id u32              = 199999
llama_model_loader: - kv  28:                tokenizer.ggml.eos_token_id u32              = 200020
llama_model_loader: - kv  29:            tokenizer.ggml.unknown_token_id u32              = 3251
llama_model_loader: - kv  30:            tokenizer.ggml.padding_token_id u32              = 200029
llama_model_loader: - kv  31:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  32:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  33:                    tokenizer.chat_template str              = {% for message in messages %}{% if me...
llama_model_loader: - kv  34:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   67 tensors
llama_model_loader: - type bf16:  129 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = BF16
print_info: file size   = 7.15 GiB (16.00 BPW)
load_hparams: Phi SWA is currently disabled - results might be suboptimal for some models (see https://github.com/ggml-org/llama.cpp/pull/13676)
load: special tokens cache size = 14
load: token to piece cache size = 1.3333 MB
print_info: arch             = phi3
print_info: vocab_only       = 0
print_info: n_ctx_train      = 131072
print_info: n_embd           = 3072
print_info: n_layer          = 32
print_info: n_head           = 24
print_info: n_head_kv        = 8
print_info: n_rot            = 96
print_info: n_swa            = 0
print_info: is_swa_any       = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 3
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 8192
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 10000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 4096
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 3B
print_info: model params     = 3.84 B
print_info: general.name     = Phi 4 Mini Instruct
print_info: vocab type       = BPE
print_info: n_vocab          = 200064
print_info: n_merges         = 199742
print_info: BOS token        = 199999 '<|endoftext|>'
print_info: EOS token        = 200020 '<|end|>'
print_info: EOT token        = 199999 '<|endoftext|>'
print_info: UNK token        = 3251 '�'
print_info: PAD token        = 200029 '<|PAD▁TOKEN|>'
print_info: LF token         = 198 'Ċ'
print_info: EOG token        = 199999 '<|endoftext|>'
print_info: EOG token        = 200020 '<|end|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 32 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 33/33 layers to GPU
load_tensors:        CUDA0 model buffer size =  7317.01 MiB
load_tensors:   CPU_Mapped model buffer size =  1172.25 MiB
......................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 131072
llama_context: n_ctx_per_seq = 131072
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: freq_base     = 10000.0
llama_context: freq_scale    = 1
llama_context:  CUDA_Host  output buffer size =     0.76 MiB
llama_kv_cache_unified:        CPU KV buffer size = 16384.00 MiB
llama_kv_cache_unified: size = 16384.00 MiB (131072 cells,  32 layers,  1 seqs), K (f16): 8192.00 MiB, V (f16): 8192.00 MiB
llama_context:      CUDA0 compute buffer size =   396.75 MiB
llama_context:  CUDA_Host compute buffer size =  6410.01 MiB
llama_context: graph nodes  = 1414
llama_context: graph splits = 66
common_init_from_params: setting dry_penalty_last_n to ctx_size = 131072
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
srv          init: initializing slots, n_slots = 1
slot         init: id  0 | task -1 | new slot n_ctx_slot = 131072
main: model loaded
main: chat template, chat_template: {% for message in messages %}{% if message['role'] == 'system' and 'tools' in message and message['tools'] is not none %}{{ '<|' + message['role'] + '|>' + message['content'] + '<|tool|>' + message['tools'] + '<|/tool|>' + '<|end|>' }}{% else %}{{ '<|' + message['role'] + '|>' + message['content'] + '<|end|>' }}{% endif %}{% endfor %}{% if add_generation_prompt %}{{ '<|assistant|>' }}{% endif %}, example_format: '<|system|>You are a helpful assistant<|end|><|user|>Hello<|end|><|assistant|>Hi there<|end|><|user|>How are you?<|end|><|assistant|>'
main: server is listening on http://:::31420 - starting the main loop
srv  update_slots: all slots are idle
+-----------------------------------------+------------------------+----------------------+
|   2  Tesla M60                      On  |   00000000:0A:00.0 Off |                  Off |
| N/A   37C    P0             37W /  150W |    8009MiB /   8192MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

At 12d0188 though, it fails to start due to lack of memory:

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: Tesla M60, compute capability 5.2, VMM: yes
build: 942 (12d0188) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
system info: n_threads = 2, n_threads_batch = 2, total_threads = 56

system_info: n_threads = 2 (n_threads_batch = 2) / 56 | CUDA : ARCHS = 500,610,700,750,800,860,890 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |

main: binding port with default address family
main: HTTP server is listening, hostname: ::, port: 31420, http threads: 55
main: loading model
srv    load_model: loading model '../models/Phi-4-mini-instruct.BF16.gguf'
llama_model_load_from_file_impl: using device CUDA0 (Tesla M60) - 8040 MiB free
llama_model_loader: loaded meta data with 35 key-value pairs and 196 tensors from ../models/Phi-4-mini-instruct.BF16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = phi3
llama_model_loader: - kv   1:              phi3.rope.scaling.attn_factor f32              = 1.190238
llama_model_loader: - kv   2:                               general.type str              = model
llama_model_loader: - kv   3:                               general.name str              = Phi 4 Mini Instruct
llama_model_loader: - kv   4:                       general.organization str              = Microsoft
llama_model_loader: - kv   5:                           general.finetune str              = instruct
llama_model_loader: - kv   6:                           general.basename str              = Phi-4
llama_model_loader: - kv   7:                       general.quantized_by str              = Unsloth
llama_model_loader: - kv   8:                         general.size_label str              = mini
llama_model_loader: - kv   9:                           general.repo_url str              = https://huggingface.co/unsloth
llama_model_loader: - kv  10:                        phi3.context_length u32              = 131072
llama_model_loader: - kv  11:  phi3.rope.scaling.original_context_length u32              = 4096
llama_model_loader: - kv  12:                      phi3.embedding_length u32              = 3072
llama_model_loader: - kv  13:                   phi3.feed_forward_length u32              = 8192
llama_model_loader: - kv  14:                           phi3.block_count u32              = 32
llama_model_loader: - kv  15:                  phi3.attention.head_count u32              = 24
llama_model_loader: - kv  16:               phi3.attention.head_count_kv u32              = 8
llama_model_loader: - kv  17:      phi3.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  18:                  phi3.rope.dimension_count u32              = 96
llama_model_loader: - kv  19:                        phi3.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  20:                          general.file_type u32              = 32
llama_model_loader: - kv  21:              phi3.attention.sliding_window u32              = 262144
llama_model_loader: - kv  22:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  23:                         tokenizer.ggml.pre str              = gpt-4o
llama_model_loader: - kv  24:                      tokenizer.ggml.tokens arr[str,200064]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  25:                  tokenizer.ggml.token_type arr[i32,200064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  26:                      tokenizer.ggml.merges arr[str,199742]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "e r", ...
llama_model_loader: - kv  27:                tokenizer.ggml.bos_token_id u32              = 199999
llama_model_loader: - kv  28:                tokenizer.ggml.eos_token_id u32              = 200020
llama_model_loader: - kv  29:            tokenizer.ggml.unknown_token_id u32              = 3251
llama_model_loader: - kv  30:            tokenizer.ggml.padding_token_id u32              = 200029
llama_model_loader: - kv  31:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  32:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  33:                    tokenizer.chat_template str              = {% for message in messages %}{% if me...
llama_model_loader: - kv  34:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   67 tensors
llama_model_loader: - type bf16:  129 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = BF16
print_info: file size   = 7.15 GiB (16.00 BPW)
load_hparams: Phi SWA is currently disabled - results might be suboptimal for some models (see https://github.com/ggml-org/llama.cpp/pull/13676)
load: special tokens cache size = 14
load: token to piece cache size = 1.3333 MB
print_info: arch             = phi3
print_info: vocab_only       = 0
print_info: n_ctx_train      = 131072
print_info: n_embd           = 3072
print_info: n_layer          = 32
print_info: n_head           = 24
print_info: n_head_kv        = 8
print_info: n_rot            = 96
print_info: n_swa            = 0
print_info: is_swa_any       = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 3
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 8192
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 10000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 4096
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 3B
print_info: model params     = 3.84 B
print_info: general.name     = Phi 4 Mini Instruct
print_info: vocab type       = BPE
print_info: n_vocab          = 200064
print_info: n_merges         = 199742
print_info: BOS token        = 199999 '<|endoftext|>'
print_info: EOS token        = 200020 '<|end|>'
print_info: EOT token        = 199999 '<|endoftext|>'
print_info: UNK token        = 3251 '�'
print_info: PAD token        = 200029 '<|PAD▁TOKEN|>'
print_info: LF token         = 198 'Ċ'
print_info: EOG token        = 199999 '<|endoftext|>'
print_info: EOG token        = 200020 '<|end|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 32 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 33/33 layers to GPU
load_tensors:        CUDA0 model buffer size =  7317.01 MiB
load_tensors:   CPU_Mapped model buffer size =  1172.25 MiB
......................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 131072
llama_context: n_ctx_per_seq = 131072
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: freq_base     = 10000.0
llama_context: freq_scale    = 1
llama_context:  CUDA_Host  output buffer size =     0.76 MiB
llama_kv_cache_unified:        CPU KV buffer size = 16384.00 MiB
llama_kv_cache_unified: size = 16384.00 MiB (131072 cells,  32 layers,  1 seqs), K (f16): 8192.00 MiB, V (f16): 8192.00 MiB
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 6940.00 MiB on device 0: cudaMalloc failed: out of memory
ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 7277119488
graph_reserve: failed to allocate compute buffers
llama_init_from_model: failed to initialize the context: failed to allocate compute pp buffers
common_init_from_params: failed to create context with model '../models/Phi-4-mini-instruct.BF16.gguf'
srv    load_model: failed to load model, '../models/Phi-4-mini-instruct.BF16.gguf'
srv    operator(): operator(): cleaning up before exit...
main: exiting due to model loading error

I can work around it by lowering the context size down to 32K, but I am not sure why this started being a problem now.

First Bad Commit

12d0188

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions