Help making sense of error messages #12904

tuaris · 2025-04-11T19:04:16Z

tuaris
Apr 11, 2025

Sometimes I run into an error like the following when trying to load models, for example, llama-3_3-nemotron-super-49b-v1-q6_k.gguf which should/maybe (?) comfortably fit in VRAM.

load_tensors: offloading 40 repeating layers to GPU
load_tensors: offloaded 40/81 layers to GPU
load_tensors:   CPU_Mapped model buffer size = 27033.38 MiB
load_tensors:      Vulkan0 model buffer size =  3268.50 MiB
load_tensors:      Vulkan1 model buffer size =  1762.38 MiB
load_tensors:      Vulkan2 model buffer size =   871.48 MiB
load_tensors:      Vulkan3 model buffer size =  6079.11 MiB
..................................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 2048
llama_context: n_ctx_per_seq = 2048
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: freq_base     = 500000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_context:        CPU  output buffer size =     0.49 MiB
init: kv_size = 2048, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 80, can_shift = 1
init:        CPU KV buffer size =   296.00 MiB
init:    Vulkan0 KV buffer size =    16.00 MiB
init:    Vulkan1 KV buffer size =     8.00 MiB
init: failed to allocate buffer for kv cache
llama_init_from_model: failed to initialize the context: failed to initialize self-attention cache
common_init_from_params: failed to create context with model 'llama-3_3-nemotron-super-49b-v1-q6_k.gguf'
srv    load_model: failed to load model, 'llama-3_3-nemotron-super-49b-v1-q6_k.gguf'
srv    operator(): operator(): cleaning up before exit...
main: exiting due to model loading error

My environment:

LLAMA_ARG_CTX_SIZE=2048
LLAMA_ARG_N_GPU_LAYERS=40

If I increase LLAMA_ARG_N_GPU_LAYERS to 80 it complains about running out of memory:

ggml_vulkan: Device memory allocation of size 894631936 failed.
ggml_vulkan: vk::Device::allocateMemory: ErrorOutOfDeviceMemory
alloc_tensor_range: failed to allocate Vulkan1 buffer of size 894631936
llama_model_load: error loading model: unable to allocate Vulkan1 buffer
llama_model_load_from_file_impl: failed to load model
common_init_from_params: failed to load model 'llama-3_3-nemotron-super-49b-v1-q6_k.gguf'
srv    load_model: failed to load model, 'llama-3_3-nemotron-super-49b-v1-q6_k.gguf'
srv    operator(): operator(): cleaning up before exit...
main: exiting due to model loading error

Even if the model is to large to fit in VRAM, I thought it was possible to also utilize the CPU and system RAM? Or am I mistaken?

Here's the output of llama-cli --version:

ggml_vulkan: Found 4 Vulkan devices:
ggml_vulkan: 0 = Tesla K80 (NVIDIA) | uma: 0 | fp16: 0 | warp size: 32 | shared memory: 49152 | matrix cores: none
ggml_vulkan: 1 = Tesla K80 (NVIDIA) | uma: 0 | fp16: 0 | warp size: 32 | shared memory: 49152 | matrix cores: none
ggml_vulkan: 2 = Tesla K80 (NVIDIA) | uma: 0 | fp16: 0 | warp size: 32 | shared memory: 49152 | matrix cores: none
ggml_vulkan: 3 = Tesla K80 (NVIDIA) | uma: 0 | fp16: 0 | warp size: 32 | shared memory: 49152 | matrix cores: none
version: 0 (unknown)
built with FreeBSD clang version 18.1.6 (https://github.com/llvm/llvm-project.git llvmorg-18.1.6-0-g1118c2e05e67) for x86_64-unknown-freebsd14.2

tuaris · 2025-04-11T20:05:45Z

tuaris
Apr 11, 2025
Author

Interesting, playing around with the other flags seemed to get the model open-thoughts_OpenThinker2-32B-Q6_K_L.gguf to load.

LLAMA_ARG_CTX_SIZE=4096
LLAMA_ARG_N_GPU_LAYERS=81
LLAMA_ARG_BATCH=1024
LLAMA_ARG_N_PARALLEL=4

I think the thing that worked was setting LLAMA_ARG_BATCH to 1024.

Unfortunately it didn't help with the llama-3_3-nemotron-super-49b-v1-q6_k.gguf or slightly smaller llama-3_3-nemotron-super-49b-v1-q4_k_m.gguf models.

Further trail/error and guessing got this one to run:

LLAMA_ARG_MODEL=Llama-3.3-70B-Instruct-Q4_K_M.gguf
LLAMA_ARG_BATCH=512

How would I find out these parameters?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Help making sense of error messages #12904

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Help making sense of error messages #12904

Uh oh!

tuaris Apr 11, 2025

Replies: 1 comment

Uh oh!

Uh oh!

tuaris Apr 11, 2025 Author

tuaris
Apr 11, 2025

tuaris
Apr 11, 2025
Author