-
Notifications
You must be signed in to change notification settings - Fork 97
Description
HI there, thanks for your work!
I have found, from this reddit post https://www.reddit.com/r/LocalLLaMA/comments/1joyl9t/new_gguf_quants_of_v30324/, about some new quants of ik_llamacpp
My system consits of a AMD Ryzen 7 7800X3D, 192GB RAM, RTX 5090, RTX 4090x2 and an RTX A6000. OS is Fedora 41.
The model used is https://huggingface.co/ubergarm/DeepSeek-V3-0324-GGUF/tree/main/DeepSeek-V3-0324-IQ2_K_R4
I'm running it with
/llama-server -m '/DeepSeek-V3-0324-IQ2_K_R4-00001-of-00005.gguf' -c 8192 -ngl 27 -ts 17,20,21,45 --no-warmup -mla 2
(or -mla 1)
I did build ik_llama.cpp with
cmake -B build -DGGML_CUDA=ON -DGGML_CUDA_FA_ALL_QUANTS=ON -DGGML_CUDA_F16=ON -DGGML_IQK_FA_ALL_QUANTS=1
The issue seems to be that, when trying to generate with any prompt, the output is gibberish (just DDDDDD)
Log is this one
/build/bin$ ./llama-server -m '/GGUFs/DeepSeek-V3-0324-IQ2_K_R4-00001-of-00005.gguf' -c 8192 -ngl 27 -ts 17,20,21,45 --no-warmup -mla 2
INFO [ main] build info | tid="140255828869120" timestamp=1743549988 build=3618 commit="6d405d1f"
INFO [ main] system info | tid="140255828869120" timestamp=1743549988 n_threads=8 n_threads_batch=-1 total_threads=16 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | "
llama_model_loader: additional 4 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 53 key-value pairs and 1147 tensors from /DeepSeek-V3-0324-IQ2_K_R4-00001-of-00005.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = deepseek2
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = DeepSeek V3 0324
llama_model_loader: - kv 3: general.version str = V3-0324
llama_model_loader: - kv 4: general.basename str = DeepSeek
llama_model_loader: - kv 5: general.size_label str = 256x21B
llama_model_loader: - kv 6: general.license str = mit
llama_model_loader: - kv 7: deepseek2.block_count u32 = 61
llama_model_loader: - kv 8: deepseek2.context_length u32 = 163840
llama_model_loader: - kv 9: deepseek2.embedding_length u32 = 7168
llama_model_loader: - kv 10: deepseek2.feed_forward_length u32 = 18432
llama_model_loader: - kv 11: deepseek2.attention.head_count u32 = 128
llama_model_loader: - kv 12: deepseek2.attention.head_count_kv u32 = 128
llama_model_loader: - kv 13: deepseek2.rope.freq_base f32 = 10000.000000
llama_model_loader: - kv 14: deepseek2.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 15: deepseek2.expert_used_count u32 = 8
llama_model_loader: - kv 16: general.file_type u32 = 338
llama_model_loader: - kv 17: deepseek2.leading_dense_block_count u32 = 3
llama_model_loader: - kv 18: deepseek2.vocab_size u32 = 129280
llama_model_loader: - kv 19: deepseek2.attention.q_lora_rank u32 = 1536
llama_model_loader: - kv 20: deepseek2.attention.kv_lora_rank u32 = 512
llama_model_loader: - kv 21: deepseek2.attention.key_length u32 = 192
llama_model_loader: - kv 22: deepseek2.attention.value_length u32 = 128
llama_model_loader: - kv 23: deepseek2.expert_feed_forward_length u32 = 2048
llama_model_loader: - kv 24: deepseek2.expert_count u32 = 256
llama_model_loader: - kv 25: deepseek2.expert_shared_count u32 = 1
llama_model_loader: - kv 26: deepseek2.expert_weights_scale f32 = 2.500000
llama_model_loader: - kv 27: deepseek2.expert_weights_norm bool = true
llama_model_loader: - kv 28: deepseek2.expert_gating_func u32 = 2
llama_model_loader: - kv 29: deepseek2.rope.dimension_count u32 = 64
llama_model_loader: - kv 30: deepseek2.rope.scaling.type str = yarn
llama_model_loader: - kv 31: deepseek2.rope.scaling.factor f32 = 40.000000
llama_model_loader: - kv 32: deepseek2.rope.scaling.original_context_length u32 = 4096
llama_model_loader: - kv 33: deepseek2.rope.scaling.yarn_log_multiplier f32 = 0.100000
llama_model_loader: - kv 34: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 35: tokenizer.ggml.pre str = deepseek-v3
llama_model_loader: - kv 36: tokenizer.ggml.tokens arr[str,129280] = ["<|begin▁of▁sentence|>", "<�...
llama_model_loader: - kv 37: tokenizer.ggml.token_type arr[i32,129280] = [3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 38: tokenizer.ggml.merges arr[str,127741] = ["Ġ t", "Ġ a", "i n", "Ġ Ġ", "h e...
llama_model_loader: - kv 39: tokenizer.ggml.bos_token_id u32 = 0
llama_model_loader: - kv 40: tokenizer.ggml.eos_token_id u32 = 1
llama_model_loader: - kv 41: tokenizer.ggml.padding_token_id u32 = 1
llama_model_loader: - kv 42: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 43: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 44: tokenizer.chat_template str = {% if not add_generation_prompt is de...
llama_model_loader: - kv 45: general.quantization_version u32 = 2
llama_model_loader: - kv 46: quantize.imatrix.file str = /mnt/raid/models/ubergarm/DeepSeek-V3...
llama_model_loader: - kv 47: quantize.imatrix.dataset str = calibration_data_v5_rc.txt
llama_model_loader: - kv 48: quantize.imatrix.entries_count i32 = 720
llama_model_loader: - kv 49: quantize.imatrix.chunks_count i32 = 213
llama_model_loader: - kv 50: split.no u16 = 0
llama_model_loader: - kv 51: split.count u16 = 5
llama_model_loader: - kv 52: split.tensors.count i32 = 1147
llama_model_loader: - type f32: 361 tensors
llama_model_loader: - type q8_0: 612 tensors
llama_model_loader: - type iq2_k_r4: 116 tensors
llama_model_loader: - type iq3_k_r4: 58 tensors
llm_load_vocab: special tokens cache size = 818
llm_load_vocab: token to piece cache size = 0.8223 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = deepseek2
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 129280
llm_load_print_meta: n_merges = 127741
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 163840
llm_load_print_meta: n_embd = 7168
llm_load_print_meta: n_layer = 61
llm_load_print_meta: n_head = 128
llm_load_print_meta: n_head_kv = 128
llm_load_print_meta: n_rot = 64
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 192
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: n_embd_k_gqa = 24576
llm_load_print_meta: n_embd_v_gqa = 16384
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-06
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 18432
llm_load_print_meta: n_expert = 256
llm_load_print_meta: n_expert_used = 8
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = yarn
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 0.025
llm_load_print_meta: n_ctx_orig_yarn = 4096
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = 671B
llm_load_print_meta: model ftype = IQ2_K_R4 - 2.375 bpw
llm_load_print_meta: model params = 672.050 B
llm_load_print_meta: model size = 226.003 GiB (2.889 BPW)
llm_load_print_meta: repeating layers = 224.169 GiB (2.873 BPW, 670.196 B parameters)
llm_load_print_meta: general.name = DeepSeek V3 0324
llm_load_print_meta: BOS token = 0 '<|begin▁of▁sentence|>'
llm_load_print_meta: EOS token = 1 '<|end▁of▁sentence|>'
llm_load_print_meta: PAD token = 1 '<|end▁of▁sentence|>'
llm_load_print_meta: LF token = 131 'Ä'
llm_load_print_meta: max token length = 256
llm_load_print_meta: n_layer_dense_lead = 3
llm_load_print_meta: n_lora_q = 1536
llm_load_print_meta: n_lora_kv = 512
llm_load_print_meta: n_ff_exp = 2048
llm_load_print_meta: n_expert_shared = 1
llm_load_print_meta: expert_weights_scale = 2.5
llm_load_print_meta: expert_weights_norm = 1
llm_load_print_meta: expert_gating_func = sigmoid
llm_load_print_meta: rope_yarn_log_mul = 0.1000
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 4 CUDA devices:
Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
Device 2: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
Device 3: NVIDIA RTX A6000, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size = 2.34 MiB
llm_load_tensors: offloading 27 repeating layers to GPU
llm_load_tensors: offloaded 27/62 layers to GPU
llm_load_tensors: CPU buffer size = 46211.13 MiB
llm_load_tensors: CPU buffer size = 47115.34 MiB
llm_load_tensors: CPU buffer size = 31151.98 MiB
llm_load_tensors: CPU buffer size = 4607.07 MiB
llm_load_tensors: CUDA0 buffer size = 19631.39 MiB
llm_load_tensors: CUDA1 buffer size = 19631.39 MiB
llm_load_tensors: CUDA2 buffer size = 23557.67 MiB
llm_load_tensors: CUDA3 buffer size = 43189.07 MiB
....................................................................................................
llama_new_context_with_model: n_ctx = 8192
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: mla_attn = 2
llama_new_context_with_model: attn_max_b = 0
llama_new_context_with_model: fused_moe = 0
llama_new_context_with_model: ser = -1, 0
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 0.025
llama_kv_cache_init: layer 0: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 1: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 2: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 3: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 4: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 5: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 6: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 7: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 8: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 9: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 10: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 11: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 12: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 13: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 14: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 15: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 16: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 17: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 18: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 19: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 20: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 21: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 22: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 23: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 24: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 25: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 26: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 27: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 28: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 29: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 30: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 31: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 32: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 33: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 34: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 35: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 36: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 37: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 38: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 39: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 40: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 41: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 42: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 43: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 44: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 45: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 46: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 47: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 48: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 49: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 50: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 51: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 52: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 53: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 54: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 55: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 56: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 57: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 58: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 59: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 60: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: CUDA_Host KV buffer size = 306.00 MiB
llama_kv_cache_init: CUDA0 KV buffer size = 45.00 MiB
llama_kv_cache_init: CUDA1 KV buffer size = 45.00 MiB
llama_kv_cache_init: CUDA2 KV buffer size = 54.00 MiB
llama_kv_cache_init: CUDA3 KV buffer size = 99.00 MiB
llama_new_context_with_model: KV self size = 549.00 MiB, c^KV (f16): 549.00 MiB, kv^T: not used
llama_new_context_with_model: CUDA_Host output buffer size = 0.99 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 2484.78 MiB
llama_new_context_with_model: CUDA1 compute buffer size = 2491.50 MiB
llama_new_context_with_model: CUDA2 compute buffer size = 2491.50 MiB
llama_new_context_with_model: CUDA3 compute buffer size = 2491.50 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 2634.50 MiB
llama_new_context_with_model: graph nodes = 3724
llama_new_context_with_model: graph splits = 707
INFO [ init] initializing slots | tid="140255828869120" timestamp=1743550245 n_slots=1
INFO [ init] new slot | tid="140255828869120" timestamp=1743550245 id_slot=0 n_ctx_slot=8192
INFO [ main] model loaded | tid="140255828869120" timestamp=1743550245
INFO [ main] chat template | tid="140255828869120" timestamp=1743550245 chat_example="You are a helpful assistant\n\n<|User|>Hello<|Assistant|>Hi there<|end▁of▁sentence|><|User|>How are you?<|Assistant|>" built_in=true
INFO [ main] HTTP server listening | tid="140255828869120" timestamp=1743550245 n_threads_http="15" port="8080" hostname="127.0.0.1"
INFO [ update_slots] all slots are idle | tid="140255828869120" timestamp=1743550245
INFO [ log_server_request] request | tid="140133399519232" timestamp=1743550253 remote_addr="127.0.0.1" remote_port=51170 status=200 method="GET" path="/" params={}
INFO [ log_server_request] request | tid="140133399519232" timestamp=1743550253 remote_addr="127.0.0.1" remote_port=51170 status=200 method="GET" path="/index.js" params={}
INFO [ log_server_request] request | tid="140133391126528" timestamp=1743550253 remote_addr="127.0.0.1" remote_port=51186 status=200 method="GET" path="/completion.js" params={}
INFO [ log_server_request] request | tid="140133399519232" timestamp=1743550253 remote_addr="127.0.0.1" remote_port=51170 status=200 method="GET" path="/json-schema-to-grammar.mjs" params={}
INFO [ log_server_request] request | tid="140133399519232" timestamp=1743550254 remote_addr="127.0.0.1" remote_port=51170 status=404 method="GET" path="/favicon.ico" params={}
INFO [ log_server_request] request | tid="140133307248640" timestamp=1743550263 remote_addr="127.0.0.1" remote_port=33660 status=200 method="GET" path="/index-new.html" params={}
INFO [ log_server_request] request | tid="140133307248640" timestamp=1743550263 remote_addr="127.0.0.1" remote_port=33660 status=200 method="GET" path="/style.css" params={}
INFO [ log_server_request] request | tid="140133298855936" timestamp=1743550263 remote_addr="127.0.0.1" remote_port=33670 status=200 method="GET" path="/index.js" params={}
INFO [ log_server_request] request | tid="140133290463232" timestamp=1743550263 remote_addr="127.0.0.1" remote_port=33686 status=200 method="GET" path="/completion.js" params={}
INFO [ log_server_request] request | tid="140133282070528" timestamp=1743550263 remote_addr="127.0.0.1" remote_port=33696 status=200 method="GET" path="/json-schema-to-grammar.mjs" params={}
INFO [ log_server_request] request | tid="140133273677824" timestamp=1743550263 remote_addr="127.0.0.1" remote_port=33704 status=200 method="GET" path="/prompt-formats.js" params={}
INFO [ log_server_request] request | tid="140133265285120" timestamp=1743550263 remote_addr="127.0.0.1" remote_port=33718 status=200 method="GET" path="/system-prompts.js" params={}
INFO [ log_server_request] request | tid="140133307248640" timestamp=1743550263 remote_addr="127.0.0.1" remote_port=33660 status=200 method="GET" path="/colorthemes.css" params={}
INFO [ log_server_request] request | tid="140133307248640" timestamp=1743550263 remote_addr="127.0.0.1" remote_port=33660 status=200 method="GET" path="/theme-snowstorm.css" params={}
INFO [ log_server_request] request | tid="140133273677824" timestamp=1743550263 remote_addr="127.0.0.1" remote_port=33704 status=200 method="GET" path="/theme-polarnight.css" params={}
INFO [ log_server_request] request | tid="140133290463232" timestamp=1743550263 remote_addr="127.0.0.1" remote_port=33686 status=200 method="GET" path="/theme-ketivah.css" params={}
INFO [ log_server_request] request | tid="140133298855936" timestamp=1743550263 remote_addr="127.0.0.1" remote_port=33670 status=200 method="GET" path="/theme-mangotango.css" params={}
INFO [ log_server_request] request | tid="140133265285120" timestamp=1743550263 remote_addr="127.0.0.1" remote_port=33718 status=200 method="GET" path="/theme-playground.css" params={}
INFO [ log_server_request] request | tid="140133282070528" timestamp=1743550263 remote_addr="127.0.0.1" remote_port=33696 status=200 method="GET" path="/theme-beeninorder.css" params={}
INFO [ log_server_request] request | tid="140133282070528" timestamp=1743550267 remote_addr="127.0.0.1" remote_port=33696 status=200 method="GET" path="/" params={}
INFO [ log_server_request] request | tid="140133282070528" timestamp=1743550267 remote_addr="127.0.0.1" remote_port=33696 status=200 method="GET" path="/index.js" params={}
INFO [ log_server_request] request | tid="140133307248640" timestamp=1743550267 remote_addr="127.0.0.1" remote_port=33660 status=200 method="GET" path="/completion.js" params={}
INFO [ log_server_request] request | tid="140133273677824" timestamp=1743550267 remote_addr="127.0.0.1" remote_port=33704 status=200 method="GET" path="/json-schema-to-grammar.mjs" params={}
INFO [ launch_slot_with_task] slot is processing task | tid="140255828869120" timestamp=1743550272 id_slot=0 id_task=0
INFO [ update_slots] kv cache rm [p0, end) | tid="140255828869120" timestamp=1743550273 id_slot=0 id_task=0 p0=0
Maybe I'm using the flag incorrectly, or I didn't build ik_llama.cpp correctly?
When not using -mla, model seems to work normally, abeit slower than UD_Q2_K_XL (https://huggingface.co/unsloth/DeepSeek-V3-0324-GGUF/tree/main/UD-Q2_K_XL)
EDIT: To note that other models have the same issue (like the mentioned above), but those probably aren't expected to work since they aren't quanted with ik_llama.cpp