Skip to content

Gibberish output when using DeepSeek-V3-0324-IQ2_K_R4 on mixed CPU + 4 GPUs with -mla (1 or 2) #305

@Panchovix

Description

@Panchovix

HI there, thanks for your work!

I have found, from this reddit post https://www.reddit.com/r/LocalLLaMA/comments/1joyl9t/new_gguf_quants_of_v30324/, about some new quants of ik_llamacpp

My system consits of a AMD Ryzen 7 7800X3D, 192GB RAM, RTX 5090, RTX 4090x2 and an RTX A6000. OS is Fedora 41.

The model used is https://huggingface.co/ubergarm/DeepSeek-V3-0324-GGUF/tree/main/DeepSeek-V3-0324-IQ2_K_R4

I'm running it with

/llama-server -m '/DeepSeek-V3-0324-IQ2_K_R4-00001-of-00005.gguf' -c 8192 -ngl 27 -ts 17,20,21,45 --no-warmup -mla 2 (or -mla 1)

I did build ik_llama.cpp with

cmake -B build -DGGML_CUDA=ON -DGGML_CUDA_FA_ALL_QUANTS=ON -DGGML_CUDA_F16=ON -DGGML_IQK_FA_ALL_QUANTS=1

The issue seems to be that, when trying to generate with any prompt, the output is gibberish (just DDDDDD)

Image

Log is this one

/build/bin$ ./llama-server -m '/GGUFs/DeepSeek-V3-0324-IQ2_K_R4-00001-of-00005.gguf' -c 8192 -ngl 27 -ts 17,20,21,45 --no-warmup -mla 2
INFO [                    main] build info | tid="140255828869120" timestamp=1743549988 build=3618 commit="6d405d1f"
INFO [                    main] system info | tid="140255828869120" timestamp=1743549988 n_threads=8 n_threads_batch=-1 total_threads=16 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | "
llama_model_loader: additional 4 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 53 key-value pairs and 1147 tensors from /DeepSeek-V3-0324-IQ2_K_R4-00001-of-00005.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = deepseek2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = DeepSeek V3 0324
llama_model_loader: - kv   3:                            general.version str              = V3-0324
llama_model_loader: - kv   4:                           general.basename str              = DeepSeek
llama_model_loader: - kv   5:                         general.size_label str              = 256x21B
llama_model_loader: - kv   6:                            general.license str              = mit
llama_model_loader: - kv   7:                      deepseek2.block_count u32              = 61
llama_model_loader: - kv   8:                   deepseek2.context_length u32              = 163840
llama_model_loader: - kv   9:                 deepseek2.embedding_length u32              = 7168
llama_model_loader: - kv  10:              deepseek2.feed_forward_length u32              = 18432
llama_model_loader: - kv  11:             deepseek2.attention.head_count u32              = 128
llama_model_loader: - kv  12:          deepseek2.attention.head_count_kv u32              = 128
llama_model_loader: - kv  13:                   deepseek2.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  14: deepseek2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  15:                deepseek2.expert_used_count u32              = 8
llama_model_loader: - kv  16:                          general.file_type u32              = 338
llama_model_loader: - kv  17:        deepseek2.leading_dense_block_count u32              = 3
llama_model_loader: - kv  18:                       deepseek2.vocab_size u32              = 129280
llama_model_loader: - kv  19:            deepseek2.attention.q_lora_rank u32              = 1536
llama_model_loader: - kv  20:           deepseek2.attention.kv_lora_rank u32              = 512
llama_model_loader: - kv  21:             deepseek2.attention.key_length u32              = 192
llama_model_loader: - kv  22:           deepseek2.attention.value_length u32              = 128
llama_model_loader: - kv  23:       deepseek2.expert_feed_forward_length u32              = 2048
llama_model_loader: - kv  24:                     deepseek2.expert_count u32              = 256
llama_model_loader: - kv  25:              deepseek2.expert_shared_count u32              = 1
llama_model_loader: - kv  26:             deepseek2.expert_weights_scale f32              = 2.500000
llama_model_loader: - kv  27:              deepseek2.expert_weights_norm bool             = true
llama_model_loader: - kv  28:               deepseek2.expert_gating_func u32              = 2
llama_model_loader: - kv  29:             deepseek2.rope.dimension_count u32              = 64
llama_model_loader: - kv  30:                deepseek2.rope.scaling.type str              = yarn
llama_model_loader: - kv  31:              deepseek2.rope.scaling.factor f32              = 40.000000
llama_model_loader: - kv  32: deepseek2.rope.scaling.original_context_length u32              = 4096
llama_model_loader: - kv  33: deepseek2.rope.scaling.yarn_log_multiplier f32              = 0.100000
llama_model_loader: - kv  34:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  35:                         tokenizer.ggml.pre str              = deepseek-v3
llama_model_loader: - kv  36:                      tokenizer.ggml.tokens arr[str,129280]  = ["<|begin▁of▁sentence|>", "<�...
llama_model_loader: - kv  37:                  tokenizer.ggml.token_type arr[i32,129280]  = [3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  38:                      tokenizer.ggml.merges arr[str,127741]  = ["Ġ t", "Ġ a", "i n", "Ġ Ġ", "h e...
llama_model_loader: - kv  39:                tokenizer.ggml.bos_token_id u32              = 0
llama_model_loader: - kv  40:                tokenizer.ggml.eos_token_id u32              = 1
llama_model_loader: - kv  41:            tokenizer.ggml.padding_token_id u32              = 1
llama_model_loader: - kv  42:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  43:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  44:                    tokenizer.chat_template str              = {% if not add_generation_prompt is de...
llama_model_loader: - kv  45:               general.quantization_version u32              = 2
llama_model_loader: - kv  46:                      quantize.imatrix.file str              = /mnt/raid/models/ubergarm/DeepSeek-V3...
llama_model_loader: - kv  47:                   quantize.imatrix.dataset str              = calibration_data_v5_rc.txt
llama_model_loader: - kv  48:             quantize.imatrix.entries_count i32              = 720
llama_model_loader: - kv  49:              quantize.imatrix.chunks_count i32              = 213
llama_model_loader: - kv  50:                                   split.no u16              = 0
llama_model_loader: - kv  51:                                split.count u16              = 5
llama_model_loader: - kv  52:                        split.tensors.count i32              = 1147
llama_model_loader: - type  f32:  361 tensors
llama_model_loader: - type q8_0:  612 tensors
llama_model_loader: - type iq2_k_r4:  116 tensors
llama_model_loader: - type iq3_k_r4:   58 tensors
llm_load_vocab: special tokens cache size = 818
llm_load_vocab: token to piece cache size = 0.8223 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = deepseek2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 129280
llm_load_print_meta: n_merges         = 127741
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 163840
llm_load_print_meta: n_embd           = 7168
llm_load_print_meta: n_layer          = 61
llm_load_print_meta: n_head           = 128
llm_load_print_meta: n_head_kv        = 128
llm_load_print_meta: n_rot            = 64
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 192
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 24576
llm_load_print_meta: n_embd_v_gqa     = 16384
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 18432
llm_load_print_meta: n_expert         = 256
llm_load_print_meta: n_expert_used    = 8
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = yarn
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 0.025
llm_load_print_meta: n_ctx_orig_yarn  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 671B
llm_load_print_meta: model ftype      = IQ2_K_R4 - 2.375 bpw
llm_load_print_meta: model params     = 672.050 B
llm_load_print_meta: model size       = 226.003 GiB (2.889 BPW) 
llm_load_print_meta: repeating layers = 224.169 GiB (2.873 BPW, 670.196 B parameters)
llm_load_print_meta: general.name     = DeepSeek V3 0324
llm_load_print_meta: BOS token        = 0 '<|begin▁of▁sentence|>'
llm_load_print_meta: EOS token        = 1 '<|end▁of▁sentence|>'
llm_load_print_meta: PAD token        = 1 '<|end▁of▁sentence|>'
llm_load_print_meta: LF token         = 131 'Ä'
llm_load_print_meta: max token length = 256
llm_load_print_meta: n_layer_dense_lead   = 3
llm_load_print_meta: n_lora_q             = 1536
llm_load_print_meta: n_lora_kv            = 512
llm_load_print_meta: n_ff_exp             = 2048
llm_load_print_meta: n_expert_shared      = 1
llm_load_print_meta: expert_weights_scale = 2.5
llm_load_print_meta: expert_weights_norm  = 1
llm_load_print_meta: expert_gating_func   = sigmoid
llm_load_print_meta: rope_yarn_log_mul    = 0.1000
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 4 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
  Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
  Device 2: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
  Device 3: NVIDIA RTX A6000, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size =    2.34 MiB
llm_load_tensors: offloading 27 repeating layers to GPU
llm_load_tensors: offloaded 27/62 layers to GPU
llm_load_tensors:        CPU buffer size = 46211.13 MiB
llm_load_tensors:        CPU buffer size = 47115.34 MiB
llm_load_tensors:        CPU buffer size = 31151.98 MiB
llm_load_tensors:        CPU buffer size =  4607.07 MiB
llm_load_tensors:      CUDA0 buffer size = 19631.39 MiB
llm_load_tensors:      CUDA1 buffer size = 19631.39 MiB
llm_load_tensors:      CUDA2 buffer size = 23557.67 MiB
llm_load_tensors:      CUDA3 buffer size = 43189.07 MiB
....................................................................................................
llama_new_context_with_model: n_ctx      = 8192
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: mla_attn   = 2
llama_new_context_with_model: attn_max_b = 0
llama_new_context_with_model: fused_moe  = 0
llama_new_context_with_model: ser        = -1, 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 0.025
llama_kv_cache_init: layer 0: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 1: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 2: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 3: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 4: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 5: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 6: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 7: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 8: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 9: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 10: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 11: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 12: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 13: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 14: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 15: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 16: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 17: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 18: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 19: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 20: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 21: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 22: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 23: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 24: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 25: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 26: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 27: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 28: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 29: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 30: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 31: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 32: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 33: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 34: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 35: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 36: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 37: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 38: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 39: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 40: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 41: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 42: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 43: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 44: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 45: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 46: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 47: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 48: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 49: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 50: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 51: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 52: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 53: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 54: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 55: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 56: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 57: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 58: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 59: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 60: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init:  CUDA_Host KV buffer size =   306.00 MiB
llama_kv_cache_init:      CUDA0 KV buffer size =    45.00 MiB
llama_kv_cache_init:      CUDA1 KV buffer size =    45.00 MiB
llama_kv_cache_init:      CUDA2 KV buffer size =    54.00 MiB
llama_kv_cache_init:      CUDA3 KV buffer size =    99.00 MiB
llama_new_context_with_model: KV self size  =  549.00 MiB, c^KV (f16):  549.00 MiB, kv^T: not used
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.99 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =  2484.78 MiB
llama_new_context_with_model:      CUDA1 compute buffer size =  2491.50 MiB
llama_new_context_with_model:      CUDA2 compute buffer size =  2491.50 MiB
llama_new_context_with_model:      CUDA3 compute buffer size =  2491.50 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =  2634.50 MiB
llama_new_context_with_model: graph nodes  = 3724
llama_new_context_with_model: graph splits = 707
INFO [                    init] initializing slots | tid="140255828869120" timestamp=1743550245 n_slots=1
INFO [                    init] new slot | tid="140255828869120" timestamp=1743550245 id_slot=0 n_ctx_slot=8192
INFO [                    main] model loaded | tid="140255828869120" timestamp=1743550245
INFO [                    main] chat template | tid="140255828869120" timestamp=1743550245 chat_example="You are a helpful assistant\n\n<|User|>Hello<|Assistant|>Hi there<|end▁of▁sentence|><|User|>How are you?<|Assistant|>" built_in=true
INFO [                    main] HTTP server listening | tid="140255828869120" timestamp=1743550245 n_threads_http="15" port="8080" hostname="127.0.0.1"
INFO [            update_slots] all slots are idle | tid="140255828869120" timestamp=1743550245
INFO [      log_server_request] request | tid="140133399519232" timestamp=1743550253 remote_addr="127.0.0.1" remote_port=51170 status=200 method="GET" path="/" params={}
INFO [      log_server_request] request | tid="140133399519232" timestamp=1743550253 remote_addr="127.0.0.1" remote_port=51170 status=200 method="GET" path="/index.js" params={}
INFO [      log_server_request] request | tid="140133391126528" timestamp=1743550253 remote_addr="127.0.0.1" remote_port=51186 status=200 method="GET" path="/completion.js" params={}
INFO [      log_server_request] request | tid="140133399519232" timestamp=1743550253 remote_addr="127.0.0.1" remote_port=51170 status=200 method="GET" path="/json-schema-to-grammar.mjs" params={}
INFO [      log_server_request] request | tid="140133399519232" timestamp=1743550254 remote_addr="127.0.0.1" remote_port=51170 status=404 method="GET" path="/favicon.ico" params={}
INFO [      log_server_request] request | tid="140133307248640" timestamp=1743550263 remote_addr="127.0.0.1" remote_port=33660 status=200 method="GET" path="/index-new.html" params={}
INFO [      log_server_request] request | tid="140133307248640" timestamp=1743550263 remote_addr="127.0.0.1" remote_port=33660 status=200 method="GET" path="/style.css" params={}
INFO [      log_server_request] request | tid="140133298855936" timestamp=1743550263 remote_addr="127.0.0.1" remote_port=33670 status=200 method="GET" path="/index.js" params={}
INFO [      log_server_request] request | tid="140133290463232" timestamp=1743550263 remote_addr="127.0.0.1" remote_port=33686 status=200 method="GET" path="/completion.js" params={}
INFO [      log_server_request] request | tid="140133282070528" timestamp=1743550263 remote_addr="127.0.0.1" remote_port=33696 status=200 method="GET" path="/json-schema-to-grammar.mjs" params={}
INFO [      log_server_request] request | tid="140133273677824" timestamp=1743550263 remote_addr="127.0.0.1" remote_port=33704 status=200 method="GET" path="/prompt-formats.js" params={}
INFO [      log_server_request] request | tid="140133265285120" timestamp=1743550263 remote_addr="127.0.0.1" remote_port=33718 status=200 method="GET" path="/system-prompts.js" params={}
INFO [      log_server_request] request | tid="140133307248640" timestamp=1743550263 remote_addr="127.0.0.1" remote_port=33660 status=200 method="GET" path="/colorthemes.css" params={}
INFO [      log_server_request] request | tid="140133307248640" timestamp=1743550263 remote_addr="127.0.0.1" remote_port=33660 status=200 method="GET" path="/theme-snowstorm.css" params={}
INFO [      log_server_request] request | tid="140133273677824" timestamp=1743550263 remote_addr="127.0.0.1" remote_port=33704 status=200 method="GET" path="/theme-polarnight.css" params={}
INFO [      log_server_request] request | tid="140133290463232" timestamp=1743550263 remote_addr="127.0.0.1" remote_port=33686 status=200 method="GET" path="/theme-ketivah.css" params={}
INFO [      log_server_request] request | tid="140133298855936" timestamp=1743550263 remote_addr="127.0.0.1" remote_port=33670 status=200 method="GET" path="/theme-mangotango.css" params={}
INFO [      log_server_request] request | tid="140133265285120" timestamp=1743550263 remote_addr="127.0.0.1" remote_port=33718 status=200 method="GET" path="/theme-playground.css" params={}
INFO [      log_server_request] request | tid="140133282070528" timestamp=1743550263 remote_addr="127.0.0.1" remote_port=33696 status=200 method="GET" path="/theme-beeninorder.css" params={}
INFO [      log_server_request] request | tid="140133282070528" timestamp=1743550267 remote_addr="127.0.0.1" remote_port=33696 status=200 method="GET" path="/" params={}
INFO [      log_server_request] request | tid="140133282070528" timestamp=1743550267 remote_addr="127.0.0.1" remote_port=33696 status=200 method="GET" path="/index.js" params={}
INFO [      log_server_request] request | tid="140133307248640" timestamp=1743550267 remote_addr="127.0.0.1" remote_port=33660 status=200 method="GET" path="/completion.js" params={}
INFO [      log_server_request] request | tid="140133273677824" timestamp=1743550267 remote_addr="127.0.0.1" remote_port=33704 status=200 method="GET" path="/json-schema-to-grammar.mjs" params={}
INFO [   launch_slot_with_task] slot is processing task | tid="140255828869120" timestamp=1743550272 id_slot=0 id_task=0
INFO [            update_slots] kv cache rm [p0, end) | tid="140255828869120" timestamp=1743550273 id_slot=0 id_task=0 p0=0

Maybe I'm using the flag incorrectly, or I didn't build ik_llama.cpp correctly?

When not using -mla, model seems to work normally, abeit slower than UD_Q2_K_XL (https://huggingface.co/unsloth/DeepSeek-V3-0324-GGUF/tree/main/UD-Q2_K_XL)

EDIT: To note that other models have the same issue (like the mentioned above), but those probably aren't expected to work since they aren't quanted with ik_llama.cpp

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions