Can't figure out why 50 GiB qwen2 model won't load into 3 x 3090 instance on runpod #8326
Replies: 3 comments 5 replies
-
Exactly the same issue with Llama 3.1 70B, on a 3 x 4060 Ti build on Linux (even "allocating 17220.00 MiB on device 0" is identical. This leads to an out of memory error, since 1 x 4060 Ti has 16 GB VRAM) @JohannesGaessler I read all your threads about Mutli GPU support, but I can't figure this out: Input:
Output:
Removing the ngl option or setting it to lower/higher values than 3, doesn't resolve the problem. But it does change some of the numbers Compiled llama.cpp with:
The same error happens when I compile without (3x 4060 Ti equals 48 GB VRAM, enough to fully load the 39 GB 4-bit llama-3.1-70b.gguf model. In Ollama, loading the exact same model works and uses multiple GPUs without error, splitting evenly by loading 13 GB onto each GPU) |
Beta Was this translation helpful? Give feedback.
-
You don't have enough vram for the context |
Beta Was this translation helpful? Give feedback.
-
50 + 24 = 74 > 72. By default the full context of the model is used but in this case this is quite a significant amount. Try adding |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello, I figure a 50.70 GiB model should fit on 3 3090's. (3 x 24 = 72) However for some reason it's getting a memory issue when trying to allocate 17200.03 MiB on device 0 (cudaMalloc). I have no idea what this issue is here. Thinking maybe I am missing something pretty simple. FYI: this is on runpod, however I have had similar issues when running on a local machine. Any help is appreciated.
Details below:
root@0e7605def17f:/home/llama.cpp# ./llama-server -m ../dolphin-2.9.2-qwen2-72b-Q5_K_M-00001-of-00002.gguf -n 400 -ngl 33 --host 0.0.0.0 -n -1 --split-mode layer --tensor-split 1,1,1
INFO [ main] build info | tid="138702059261952" timestamp=1720170580 build=3317 commit="8e558309"
INFO [ main] system info | tid="138702059261952" timestamp=1720170580 n_threads=128 n_threads_batch=-1 total_threads=256 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | "
llama_model_loader: additional 1 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 27 key-value pairs and 963 tensors from ../dolphin-2.9.2-qwen2-72b-Q5_K_M-00001-of-00002.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen2
llama_model_loader: - kv 1: general.name str = dolphin-2.9.2-qwen2-72b
llama_model_loader: - kv 2: qwen2.block_count u32 = 80
llama_model_loader: - kv 3: qwen2.context_length u32 = 131072
llama_model_loader: - kv 4: qwen2.embedding_length u32 = 8192
llama_model_loader: - kv 5: qwen2.feed_forward_length u32 = 29568
llama_model_loader: - kv 6: qwen2.attention.head_count u32 = 64
llama_model_loader: - kv 7: qwen2.attention.head_count_kv u32 = 8
llama_model_loader: - kv 8: qwen2.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 9: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 10: general.file_type u32 = 17
llama_model_loader: - kv 11: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 12: tokenizer.ggml.pre str = qwen2
llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,152064] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 15: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 16: tokenizer.ggml.eos_token_id u32 = 151645
llama_model_loader: - kv 17: tokenizer.ggml.padding_token_id u32 = 151643
llama_model_loader: - kv 18: tokenizer.chat_template str = {% if not add_generation_prompt is de...
llama_model_loader: - kv 19: general.quantization_version u32 = 2
llama_model_loader: - kv 20: quantize.imatrix.file str = /models/dolphin-2.9.2-qwen2-72b-GGUF/...
llama_model_loader: - kv 21: quantize.imatrix.dataset str = /training_data/calibration_datav3.txt
llama_model_loader: - kv 22: quantize.imatrix.entries_count i32 = 560
llama_model_loader: - kv 23: quantize.imatrix.chunks_count i32 = 128
llama_model_loader: - kv 24: split.no u16 = 0
llama_model_loader: - kv 25: split.count u16 = 2
llama_model_loader: - kv 26: split.tensors.count i32 = 963
llama_model_loader: - type f32: 401 tensors
llama_model_loader: - type q5_1: 40 tensors
llama_model_loader: - type q8_0: 40 tensors
llama_model_loader: - type q5_K: 441 tensors
llama_model_loader: - type q6_K: 41 tensors
llm_load_vocab: special tokens cache size = 421
llm_load_vocab: token to piece cache size = 0.9352 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = qwen2
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 152064
llm_load_print_meta: n_merges = 151387
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 131072
llm_load_print_meta: n_embd = 8192
llm_load_print_meta: n_layer = 80
llm_load_print_meta: n_head = 64
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 8
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 29568
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 2
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 131072
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = 70B
llm_load_print_meta: model ftype = Q5_K - Medium
llm_load_print_meta: model params = 72.71 B
llm_load_print_meta: model size = 50.70 GiB (5.99 BPW)
llm_load_print_meta: general.name = dolphin-2.9.2-qwen2-72b
llm_load_print_meta: BOS token = 11 ','
llm_load_print_meta: EOS token = 151645 '<|im_end|>'
llm_load_print_meta: PAD token = 151643 '<|endoftext|>'
llm_load_print_meta: LF token = 148848 'ÄĬ'
llm_load_print_meta: EOT token = 151645 '<|im_end|>'
llm_load_print_meta: max token length = 256
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 3 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
Device 2: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size = 1.69 MiB
llm_load_tensors: offloading 33 repeating layers to GPU
llm_load_tensors: offloaded 33/81 layers to GPU
llm_load_tensors: CPU buffer size = 30157.15 MiB
llm_load_tensors: CPU buffer size = 974.56 MiB
llm_load_tensors: CUDA0 buffer size = 6782.74 MiB
llm_load_tensors: CUDA1 buffer size = 6709.49 MiB
llm_load_tensors: CUDA2 buffer size = 7295.49 MiB
...................................................................................................
llama_new_context_with_model: n_ctx = 131072
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CUDA_Host KV buffer size = 24064.00 MiB
llama_kv_cache_init: CUDA0 KV buffer size = 5632.00 MiB
llama_kv_cache_init: CUDA1 KV buffer size = 5632.00 MiB
llama_kv_cache_init: CUDA2 KV buffer size = 5632.00 MiB
llama_new_context_with_model: KV self size = 40960.00 MiB, K (f16): 20480.00 MiB, V (f16): 20480.00 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 1.16 MiB
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 17200.03 MiB on device 0: cudaMalloc failed: out of memory
ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 18035542016
llama_new_context_with_model: failed to allocate compute buffers
llama_init_from_gpt_params: error: failed to create context with model '../dolphin-2.9.2-qwen2-72b-Q5_K_M-00001-of-00002.gguf'
ERR [ load_model] unable to load model | tid="138702059261952" timestamp=1720170626 model="../dolphin-2.9.2-qwen2-72b-Q5_K_M-00001-of-00002.gguf"
munmap_chunk(): invalid pointer
Aborted (core dumped)
root@0e7605def17f:/home/llama.cpp# nvidia-smi
Fri Jul 5 09:15:40 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.78 Driver Version: 550.78 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 3090 On | 00000000:25:00.0 Off | N/A |
| 30% 30C P8 31W / 350W | 1MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA GeForce RTX 3090 On | 00000000:61:00.0 Off | N/A |
| 30% 25C P8 27W / 350W | 1MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA GeForce RTX 3090 On | 00000000:A1:00.0 Off | N/A |
| 30% 29C P8 40W / 350W | 1MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
root@0e7605def17f:/home/llama.cpp# free -h
total used free shared buff/cache available
Mem: 1.0Ti 34Gi 223Gi 95Mi 749Gi 966Gi
Swap: 0B 0B 0B
root@0e7605def17f:/home/llama.cpp#
Beta Was this translation helpful? Give feedback.
All reactions