Can't figure out why 50 GiB qwen2 model won't load into 3 x 3090 instance on runpod #8326

carljones3000 · 2024-07-05T10:25:35Z

carljones3000
Jul 5, 2024

Hello, I figure a 50.70 GiB model should fit on 3 3090's. (3 x 24 = 72) However for some reason it's getting a memory issue when trying to allocate 17200.03 MiB on device 0 (cudaMalloc). I have no idea what this issue is here. Thinking maybe I am missing something pretty simple. FYI: this is on runpod, however I have had similar issues when running on a local machine. Any help is appreciated.

Details below:

root@0e7605def17f:/home/llama.cpp# ./llama-server -m ../dolphin-2.9.2-qwen2-72b-Q5_K_M-00001-of-00002.gguf -n 400 -ngl 33 --host 0.0.0.0 -n -1 --split-mode layer --tensor-split 1,1,1
INFO [ main] build info | tid="138702059261952" timestamp=1720170580 build=3317 commit="8e558309"
INFO [ main] system info | tid="138702059261952" timestamp=1720170580 n_threads=128 n_threads_batch=-1 total_threads=256 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | "
llama_model_loader: additional 1 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 27 key-value pairs and 963 tensors from ../dolphin-2.9.2-qwen2-72b-Q5_K_M-00001-of-00002.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen2
llama_model_loader: - kv 1: general.name str = dolphin-2.9.2-qwen2-72b
llama_model_loader: - kv 2: qwen2.block_count u32 = 80
llama_model_loader: - kv 3: qwen2.context_length u32 = 131072
llama_model_loader: - kv 4: qwen2.embedding_length u32 = 8192
llama_model_loader: - kv 5: qwen2.feed_forward_length u32 = 29568
llama_model_loader: - kv 6: qwen2.attention.head_count u32 = 64
llama_model_loader: - kv 7: qwen2.attention.head_count_kv u32 = 8
llama_model_loader: - kv 8: qwen2.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 9: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 10: general.file_type u32 = 17
llama_model_loader: - kv 11: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 12: tokenizer.ggml.pre str = qwen2
llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,152064] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 15: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 16: tokenizer.ggml.eos_token_id u32 = 151645
llama_model_loader: - kv 17: tokenizer.ggml.padding_token_id u32 = 151643
llama_model_loader: - kv 18: tokenizer.chat_template str = {% if not add_generation_prompt is de...
llama_model_loader: - kv 19: general.quantization_version u32 = 2
llama_model_loader: - kv 20: quantize.imatrix.file str = /models/dolphin-2.9.2-qwen2-72b-GGUF/...
llama_model_loader: - kv 21: quantize.imatrix.dataset str = /training_data/calibration_datav3.txt
llama_model_loader: - kv 22: quantize.imatrix.entries_count i32 = 560
llama_model_loader: - kv 23: quantize.imatrix.chunks_count i32 = 128
llama_model_loader: - kv 24: split.no u16 = 0
llama_model_loader: - kv 25: split.count u16 = 2
llama_model_loader: - kv 26: split.tensors.count i32 = 963
llama_model_loader: - type f32: 401 tensors
llama_model_loader: - type q5_1: 40 tensors
llama_model_loader: - type q8_0: 40 tensors
llama_model_loader: - type q5_K: 441 tensors
llama_model_loader: - type q6_K: 41 tensors
llm_load_vocab: special tokens cache size = 421
llm_load_vocab: token to piece cache size = 0.9352 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = qwen2
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 152064
llm_load_print_meta: n_merges = 151387
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 131072
llm_load_print_meta: n_embd = 8192
llm_load_print_meta: n_layer = 80
llm_load_print_meta: n_head = 64
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 8
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 29568
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 2
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 131072
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = 70B
llm_load_print_meta: model ftype = Q5_K - Medium
llm_load_print_meta: model params = 72.71 B
llm_load_print_meta: model size = 50.70 GiB (5.99 BPW)
llm_load_print_meta: general.name = dolphin-2.9.2-qwen2-72b
llm_load_print_meta: BOS token = 11 ','
llm_load_print_meta: EOS token = 151645 '<|im_end|>'
llm_load_print_meta: PAD token = 151643 '<|endoftext|>'
llm_load_print_meta: LF token = 148848 'ÄĬ'
llm_load_print_meta: EOT token = 151645 '<|im_end|>'
llm_load_print_meta: max token length = 256
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 3 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
Device 2: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size = 1.69 MiB
llm_load_tensors: offloading 33 repeating layers to GPU
llm_load_tensors: offloaded 33/81 layers to GPU
llm_load_tensors: CPU buffer size = 30157.15 MiB
llm_load_tensors: CPU buffer size = 974.56 MiB
llm_load_tensors: CUDA0 buffer size = 6782.74 MiB
llm_load_tensors: CUDA1 buffer size = 6709.49 MiB
llm_load_tensors: CUDA2 buffer size = 7295.49 MiB
...................................................................................................
llama_new_context_with_model: n_ctx = 131072
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CUDA_Host KV buffer size = 24064.00 MiB
llama_kv_cache_init: CUDA0 KV buffer size = 5632.00 MiB
llama_kv_cache_init: CUDA1 KV buffer size = 5632.00 MiB
llama_kv_cache_init: CUDA2 KV buffer size = 5632.00 MiB
llama_new_context_with_model: KV self size = 40960.00 MiB, K (f16): 20480.00 MiB, V (f16): 20480.00 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 1.16 MiB
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 17200.03 MiB on device 0: cudaMalloc failed: out of memory
ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 18035542016
llama_new_context_with_model: failed to allocate compute buffers
llama_init_from_gpt_params: error: failed to create context with model '../dolphin-2.9.2-qwen2-72b-Q5_K_M-00001-of-00002.gguf'
ERR [ load_model] unable to load model | tid="138702059261952" timestamp=1720170626 model="../dolphin-2.9.2-qwen2-72b-Q5_K_M-00001-of-00002.gguf"
munmap_chunk(): invalid pointer
Aborted (core dumped)
root@0e7605def17f:/home/llama.cpp# nvidia-smi
Fri Jul 5 09:15:40 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.78 Driver Version: 550.78 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 3090 On | 00000000:25:00.0 Off | N/A |
| 30% 30C P8 31W / 350W | 1MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA GeForce RTX 3090 On | 00000000:61:00.0 Off | N/A |
| 30% 25C P8 27W / 350W | 1MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA GeForce RTX 3090 On | 00000000:A1:00.0 Off | N/A |
| 30% 29C P8 40W / 350W | 1MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
root@0e7605def17f:/home/llama.cpp# free -h
total used free shared buff/cache available
Mem: 1.0Ti 34Gi 223Gi 95Mi 749Gi 966Gi
Swap: 0B 0B 0B
root@0e7605def17f:/home/llama.cpp#

chrs1997 · 2024-07-25T08:49:03Z

chrs1997
Jul 25, 2024

Exactly the same issue with Llama 3.1 70B, on a 3 x 4060 Ti build on Linux (even "allocating 17220.00 MiB on device 0" is identical. This leads to an out of memory error, since 1 x 4060 Ti has 16 GB VRAM)

@JohannesGaessler I read all your threads about Mutli GPU support, but I can't figure this out:

Input:

./llama-cli -m models/llama-3.1-70b.gguf -p "You are a helpful assistant" -cnv -ngl 3

Output:

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: yes ggml_cuda_init: found 3 CUDA devices: Device 0: NVIDIA GeForce RTX 4060 Ti, compute capability 8.9, VMM: yes Device 1: NVIDIA GeForce RTX 4060 Ti, compute capability 8.9, VMM: yes Device 2: NVIDIA GeForce RTX 4060 Ti, compute capability 8.9, VMM: yes llm_load_tensors: ggml ctx size = 1.35 MiB llm_load_tensors: offloading 3 repeating layers to GPU llm_load_tensors: offloaded 3/81 layers to GPU llm_load_tensors: CPU buffer size = 38110.61 MiB llm_load_tensors: CUDA0 buffer size = 459.06 MiB llm_load_tensors: CUDA1 buffer size = 459.06 MiB llm_load_tensors: CUDA2 buffer size = 459.06 MiB ................................................................................................... llama_new_context_with_model: n_ctx = 131072 llama_new_context_with_model: n_batch = 2048 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 500000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA_Host KV buffer size = 39424.00 MiB llama_kv_cache_init: CUDA0 KV buffer size = 512.00 MiB llama_kv_cache_init: CUDA1 KV buffer size = 512.00 MiB llama_kv_cache_init: CUDA2 KV buffer size = 512.00 MiB llama_new_context_with_model: KV self size = 40960.00 MiB, K (f16): 20480.00 MiB, V (f16): 20480.00 MiB llama_new_context_with_model: CUDA_Host output buffer size = 0.49 MiB ggml_backend_cuda_buffer_type_alloc_buffer: allocating 17220.00 MiB on device 0: cudaMalloc failed: out of memory ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 18056480768 llama_new_context_with_model: failed to allocate compute buffers

Removing the ngl option or setting it to lower/higher values than 3, doesn't resolve the problem. But it does change some of the numbers

Compiled llama.cpp with:

make -j32 GGML_CUDA=1 GGML_CUDA_FORCE_CUBLAS=true

The same error happens when I compile without GGML_CUDA_FORCE_CUBLAS=true

(3x 4060 Ti equals 48 GB VRAM, enough to fully load the 39 GB 4-bit llama-3.1-70b.gguf model. In Ollama, loading the exact same model works and uses multiple GPUs without error, splitting evenly by loading 13 GB onto each GPU)

2 replies

chrs1997 Jul 25, 2024

Is this related to 128k context?

JohannesGaessler Jul 25, 2024
Collaborator

Very likely. Try reducing the context and/or add -fa -ctk q8_0 -ctv q8_0.

8XXD8 · 2024-07-25T09:10:51Z

8XXD8
Jul 25, 2024

llama_new_context_with_model: KV self size = 40960.00 MiB, K (f16): 20480.00 MiB, V (f16): 20480.00 MiB

You don't have enough vram for the context

0 replies

JohannesGaessler · 2024-07-25T09:29:48Z

JohannesGaessler
Jul 25, 2024
Collaborator

llama_kv_cache_init: CUDA_Host KV buffer size = 24064.00 MiB

50 + 24 = 74 > 72. By default the full context of the model is used but in this case this is quite a significant amount. Try adding -fa -ctk q8_0 -ctv q8_0.

3 replies

chrs1997 Jul 25, 2024

Thanks. Adding -fa and increasing -ngl makes it load and run, both with and without the -ctk and -ctv options

-ctk q8_0 -ctv q8_0 allows me to increase -ngl to 56 without out of memory error, while with only -fa I can increase -ngl to 40 with results in slightly slower token generation

-fa only results in ~2.5 tokens/second, while -ctk -ctv and -ngl 56 results in ~3-4 tokens/second on 3 x 4060 Ti 16 GB, which seems slow

In combination with -c 32768 (25% context size): ./llama-cli -m models/llama-3.1-70b.gguf -p "You are a helpful assistant" -cnv -ngl 81 -fa -ctk q8_0 -ctv q8_0 -c 32768

Results in 6-7 tokens/second, which is okay. However I can see in nvtop that they only use about 50% of my multi GPU's compute each, so it's not really pushing my GPUs to their limit, only their memory is close to 100%

chrs1997 Jul 25, 2024

I tried this for the XL 405B model: ./llama-cli -m models/llama-3.1-405b.gguf -p "You are a helpful assistant" -cnv -ngl 20 -fa -ctk q8_0 -ctv q8_0 -c 32768

While it almost loads 100% into the 3 GPUs VRAM, it almost uses no GPU compute (0%-25%), resulting in very slow generation (0.34 tokens per second) - while using most of our 32 CPU threads

JohannesGaessler Jul 25, 2024
Collaborator

If you can only fit 20 GPU layers you will be inevitably bottlenecked by the CPU/RAM speed and your GPUs will spend most of their time waiting. Consider this plot:

Can't figure out why 50 GiB qwen2 model won't load into 3 x 3090 instance on runpod #8326

Uh oh!

carljones3000 Jul 5, 2024

Replies: 3 comments · 5 replies

Uh oh!

Uh oh!

chrs1997 Jul 25, 2024

Uh oh!

chrs1997 Jul 25, 2024

Uh oh!

JohannesGaessler Jul 25, 2024 Collaborator

Uh oh!

Uh oh!

8XXD8 Jul 25, 2024

Uh oh!

JohannesGaessler Jul 25, 2024 Collaborator

Uh oh!

chrs1997 Jul 25, 2024

Uh oh!

Uh oh!

chrs1997 Jul 25, 2024

Uh oh!

JohannesGaessler Jul 25, 2024 Collaborator

carljones3000
Jul 5, 2024

Replies: 3 comments 5 replies

chrs1997
Jul 25, 2024

JohannesGaessler Jul 25, 2024
Collaborator

8XXD8
Jul 25, 2024

JohannesGaessler
Jul 25, 2024
Collaborator

JohannesGaessler Jul 25, 2024
Collaborator