Possible numerical stability issue with experimental quant of DeepSeek-V3-0324?

## tl;dr;
*UPDATE*: skip to the end, I probably shouldn't use `q8_0_r8` for `token_embd.weight` and just leave that `q8_0`.

I cooked up a `DeepSeek-V3-0324` quant specificly for CPU only inferencing on the xeon 6980P rig and am getting very large perplexity values and broken llama-server responses.

Not sure if user error, an invalid recipe, or if there is some issue with computing one of the quant types etc.

## Details

This was my intended recipe mix:

* `q8_0_r8` for all the embeddings, attention, norms, bias, and shared experts tensors
* `q5_k_r4` for all routed MoE down projection tensors
* `q4_k_r4` for all routed MoE gate/up tensors

This is what is reported when starting up with it:
```
llama_model_loader: - type  f32:  361 tensors
llama_model_loader: - type q8_0_r8:  612 tensors
llama_model_loader: - type iq4_k_r4:  116 tensors
llama_model_loader: - type iq5_k_r4:   58 tensors
```

I'm not 100% sure if the issue could be with the `q5_k_r4` or `q4_k_r4` inferencing CPU computation possibly? Or maybe I messed up somewhere in my scripts.

Potentially relevent topics:
1. Recent PR292 seems to have fixed the previous issue with `q8_0` numerical stability.
2. I asked @saood06 as he has been experimenting with these quants in [our discussion here](https://github.com/ikawrakow/ik_llama.cpp/discussions/286#discussioncomment-12668598).

## Logs

I've provided logs of quantization, perplexity, and llama-server below for reference.

Everything rebuilt and run on updated `ik_llama.cpp/main@4819257c`.

<details>

<summary>Quantization Procedure</summary>

#### Quantization Recipe Script
```bash
#!/usr/bin/env bash

custom="
# Token embedding and output tensors
token_embd\.weight=q8_0_r8
output\.weight=q8_0_r8
output_norm\.weight=q8_0_r8

# First 3 dense layers (0-3)
blk\.[0-2]\..*=q8_0_r8

# All attention, norm weights, and bias tensors for MoE layers (3-60)
blk\.[3-9]\.attn_.*=q8_0_r8
blk\.[1-5][0-9]\.attn_.*=q8_0_r8
blk\.60\.attn_.*=q8_0_r8

blk\.[3-9]\.ffn_norm\.weight=q8_0_r8
blk\.[1-5][0-9]\.ffn_norm\.weight=q8_0_r8
blk\.60\.ffn_norm\.weight=q8_0_r8

blk\.[3-9]\.exp_probs_b\.bias=q8_0_r8
blk\.[1-5][0-9]\.exp_probs_b\.bias=q8_0_r8
blk\.60\.exp_probs_b\.bias=q8_0_r8

# Shared Experts (3-60)
blk\.[3-9]\.ffn_down_shexp\.weight=q8_0_r8
blk\.[1-5][0-9]\.ffn_down_shexp\.weight=q8_0_r8
blk\.60\.ffn_down_shexp\.weight=q8_0_r8

blk\.[3-9]\.ffn_(gate|up)_shexp\.weight=q8_0_r8
blk\.[1-5][0-9]\.ffn_(gate|up)_shexp\.weight=q8_0_r8
blk\.60\.ffn_(gate|up)_shexp\.weight=q8_0_r8

# MoE Experts (3-60)
blk\.[3-9]\.ffn_down_exps\.weight=iq5_k_r4
blk\.[1-5][0-9]\.ffn_down_exps\.weight=iq5_k_r4
blk\.60\.ffn_down_exps\.weight=iq5_k_r4

blk\.[3-9]\.ffn_(gate|up)_exps\.weight=iq4_k_r4
blk\.[1-5][0-9]\.ffn_(gate|up)_exps\.weight=iq4_k_r4
blk\.60\.ffn_(gate|up)_exps\.weight=iq4_k_r4
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

./build/bin/llama-quantize \
    --imatrix /mnt/raid/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324.imatrix \
    --token-embedding-type q8_0_r8 \
    --output-tensor-type q8_0_r8 \
    --custom-q "$custom" \
    /mnt/raid/models/deepseek-ai/DeepSeek-V3-0324-bf16-GGUF/DeepSeek-256x21B-V3-0324-BF16-00001-of-00030.gguf \
    /mnt/raid/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-CPU-IQ4_K_R4.gguf \
    IQ4_K_R4 \
    24
```

#### Output Logs
```bash
main: build = 3613 (4819257c)
main: built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
main: quantizing '/mnt/raid/models/deepseek-ai/DeepSeek-V3-0324-bf16-GGUF/DeepSeek-256x21B-V3-0324-BF16-00001-of-00030.gguf' to '/mnt/raid/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-CPU-IQ4_K_R4.gguf' as IQ4_K_R4 using 24 threads
llama_model_loader: additional 29 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 49 key-value pairs and 1147 tensors from /mnt/raid/models/deepseek-ai/DeepSeek-V3-0324-bf16-GGUF/DeepSeek-256x21B-V3-0324-BF16-00001-of-00030.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = deepseek2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = DeepSeek V3 0324
llama_model_loader: - kv   3:                            general.version str              = V3-0324
llama_model_loader: - kv   4:                           general.basename str              = DeepSeek
llama_model_loader: - kv   5:                         general.size_label str              = 256x21B
llama_model_loader: - kv   6:                            general.license str              = mit
llama_model_loader: - kv   7:                      deepseek2.block_count u32              = 61
llama_model_loader: - kv   8:                   deepseek2.context_length u32              = 163840
llama_model_loader: - kv   9:                 deepseek2.embedding_length u32              = 7168
llama_model_loader: - kv  10:              deepseek2.feed_forward_length u32              = 18432
llama_model_loader: - kv  11:             deepseek2.attention.head_count u32              = 128
llama_model_loader: - kv  12:          deepseek2.attention.head_count_kv u32              = 128
llama_model_loader: - kv  13:                   deepseek2.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  14: deepseek2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  15:                deepseek2.expert_used_count u32              = 8
llama_model_loader: - kv  16:                          general.file_type u32              = 32
llama_model_loader: - kv  17:        deepseek2.leading_dense_block_count u32              = 3
llama_model_loader: - kv  18:                       deepseek2.vocab_size u32              = 129280
llama_model_loader: - kv  19:            deepseek2.attention.q_lora_rank u32              = 1536
llama_model_loader: - kv  20:           deepseek2.attention.kv_lora_rank u32              = 512
llama_model_loader: - kv  21:             deepseek2.attention.key_length u32              = 192
llama_model_loader: - kv  22:           deepseek2.attention.value_length u32              = 128
llama_model_loader: - kv  23:       deepseek2.expert_feed_forward_length u32              = 2048
llama_model_loader: - kv  24:                     deepseek2.expert_count u32              = 256
llama_model_loader: - kv  25:              deepseek2.expert_shared_count u32              = 1
llama_model_loader: - kv  26:             deepseek2.expert_weights_scale f32              = 2.500000
llama_model_loader: - kv  27:              deepseek2.expert_weights_norm bool             = true
llama_model_loader: - kv  28:               deepseek2.expert_gating_func u32              = 2
llama_model_loader: - kv  29:             deepseek2.rope.dimension_count u32              = 64
llama_model_loader: - kv  30:                deepseek2.rope.scaling.type str              = yarn
llama_model_loader: - kv  31:              deepseek2.rope.scaling.factor f32              = 40.000000
llama_model_loader: - kv  32: deepseek2.rope.scaling.original_context_length u32              = 4096
llama_model_loader: - kv  33: deepseek2.rope.scaling.yarn_log_multiplier f32              = 0.100000
llama_model_loader: - kv  34:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  35:                         tokenizer.ggml.pre str              = deepseek-v3
llama_model_loader: - kv  36:                      tokenizer.ggml.tokens arr[str,129280]  = ["
llama_model_loader: - kv  37:                  tokenizer.ggml.token_type arr[i32,129280]  = [3
llama_model_loader: - kv  38:                      tokenizer.ggml.merges arr[str,127741]  = ["
llama_model_loader: - kv  39:                tokenizer.ggml.bos_token_id u32              = 0
llama_model_loader: - kv  40:                tokenizer.ggml.eos_token_id u32              = 1
llama_model_loader: - kv  41:            tokenizer.ggml.padding_token_id u32              = 1
llama_model_loader: - kv  42:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  43:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  44:                    tokenizer.chat_template str              = {% if not add_generation_prompt is de...
llama_model_loader: - kv  45:               general.quantization_version u32              = 2
llama_model_loader: - kv  46:                                   split.no u16              = 0
llama_model_loader: - kv  47:                                split.count u16              = 30
llama_model_loader: - kv  48:                        split.tensors.count i32              = 1147
llama_model_loader: - type  f32:  361 tensors
llama_model_loader: - type bf16:  786 tensors
================================ Have weights data with 720 entries
[   1/1147]                    token_embd.weight - [ 7168, 129280,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor token_embd.weight

====== llama_model_quantize_internal: did not find weights for token_embd.weight
converting to q8_0_r8 .. Adding custom rule token_embd\.weight -> q8_0_r8
Adding custom rule output\.weight -> q8_0_r8
Adding custom rule output_norm\.weight -> q8_0_r8
Adding custom rule blk\.[0-2]\..* -> q8_0_r8
Adding custom rule blk\.[3-9]\.attn_.* -> q8_0_r8
Adding custom rule blk\.[1-5][0-9]\.attn_.* -> q8_0_r8
Adding custom rule blk\.60\.attn_.* -> q8_0_r8
Adding custom rule blk\.[3-9]\.ffn_norm\.weight -> q8_0_r8
Adding custom rule blk\.[1-5][0-9]\.ffn_norm\.weight -> q8_0_r8
Adding custom rule blk\.60\.ffn_norm\.weight -> q8_0_r8
Adding custom rule blk\.[3-9]\.exp_probs_b\.bias -> q8_0_r8
Adding custom rule blk\.[1-5][0-9]\.exp_probs_b\.bias -> q8_0_r8
Adding custom rule blk\.60\.exp_probs_b\.bias -> q8_0_r8
Adding custom rule blk\.[3-9]\.ffn_down_shexp\.weight -> q8_0_r8
Adding custom rule blk\.[1-5][0-9]\.ffn_down_shexp\.weight -> q8_0_r8
Adding custom rule blk\.60\.ffn_down_shexp\.weight -> q8_0_r8
Adding custom rule blk\.[3-9]\.ffn_(gate|up)_shexp\.weight -> q8_0_r8
Adding custom rule blk\.[1-5][0-9]\.ffn_(gate|up)_shexp\.weight -> q8_0_r8
Adding custom rule blk\.60\.ffn_(gate|up)_shexp\.weight -> q8_0_r8
Adding custom rule blk\.[3-9]\.ffn_down_exps\.weight -> iq5_k_r4
Adding custom rule blk\.[1-5][0-9]\.ffn_down_exps\.weight -> iq5_k_r4
Adding custom rule blk\.60\.ffn_down_exps\.weight -> iq5_k_r4
Adding custom rule blk\.[3-9]\.ffn_(gate|up)_exps\.weight -> iq4_k_r4
Adding custom rule blk\.[1-5][0-9]\.ffn_(gate|up)_exps\.weight -> iq4_k_r4
Adding custom rule blk\.60\.ffn_(gate|up)_exps\.weight -> iq4_k_r4
load_imatrix: imatrix dataset='calibration_data_v5_rc.txt'
load_imatrix: loaded 720 importance matrix entries from /mnt/raid/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324.imatrix computed on 213 chunks
prepare_imatrix: have 720 importance matrix entries
size =  1767.50 MiB ->   938.98 MiB
[   2/1147]               blk.0.attn_norm.weight - [ 7168,     1,     1,     1], type =    f32, size =    0.027 MB
[   3/1147]                blk.0.ffn_down.weight - [18432,  7168,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.0.ffn_down.weight
converting to q8_0_r8 .. size =   252.00 MiB ->   133.88 MiB
[   4/1147]                blk.0.ffn_gate.weight - [ 7168, 18432,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.0.ffn_gate.weight
converting to q8_0_r8 .. size =   252.00 MiB ->   133.88 MiB
[   5/1147]                  blk.0.ffn_up.weight - [ 7168, 18432,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.0.ffn_up.weight
converting to q8_0_r8 .. size =   252.00 MiB ->   133.88 MiB
[   6/1147]                blk.0.ffn_norm.weight - [ 7168,     1,     1,     1], type =    f32, size =    0.027 MB
[   7/1147]          blk.0.attn_kv_a_norm.weight - [  512,     1,     1,     1], type =    f32, size =    0.002 MB
[   8/1147]           blk.0.attn_kv_a_mqa.weight - [ 7168,   576,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.0.attn_kv_a_mqa.weight
converting to q8_0_r8 .. size =     7.88 MiB ->     4.18 MiB
[   9/1147]               blk.0.attn_kv_b.weight - [  512, 32768,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.0.attn_kv_b.weight
converting to q8_0_r8 .. size =    32.00 MiB ->    17.00 MiB
[  10/1147]                blk.0.attn_k_b.weight - [  128, 65536,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.0.attn_k_b.weight

====== llama_model_quantize_internal: did not find weights for blk.0.attn_k_b.weight
converting to q8_0_r8 .. size =    16.00 MiB ->     8.50 MiB
[  11/1147]                blk.0.attn_v_b.weight - [  512, 16384,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.0.attn_v_b.weight
converting to q8_0_r8 .. size =    16.00 MiB ->     8.50 MiB
[  12/1147]             blk.0.attn_output.weight - [16384,  7168,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.0.attn_output.weight
converting to q8_0_r8 .. size =   224.00 MiB ->   119.00 MiB
[  13/1147]           blk.0.attn_q_a_norm.weight - [ 1536,     1,     1,     1], type =    f32, size =    0.006 MB
[  14/1147]                blk.0.attn_q_a.weight - [ 7168,  1536,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.0.attn_q_a.weight
converting to q8_0_r8 .. size =    21.00 MiB ->    11.16 MiB
[  15/1147]                blk.0.attn_q_b.weight - [ 1536, 24576,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.0.attn_q_b.weight
converting to q8_0_r8 .. size =    72.00 MiB ->    38.25 MiB
[  16/1147]               blk.1.attn_norm.weight - [ 7168,     1,     1,     1], type =    f32, size =    0.027 MB
[  17/1147]                blk.1.ffn_down.weight - [18432,  7168,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.1.ffn_down.weight
converting to q8_0_r8 .. size =   252.00 MiB ->   133.88 MiB
[  18/1147]                blk.1.ffn_gate.weight - [ 7168, 18432,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.1.ffn_gate.weight
converting to q8_0_r8 .. size =   252.00 MiB ->   133.88 MiB
[  19/1147]                  blk.1.ffn_up.weight - [ 7168, 18432,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.1.ffn_up.weight
converting to q8_0_r8 .. size =   252.00 MiB ->   133.88 MiB
[  20/1147]                blk.1.ffn_norm.weight - [ 7168,     1,     1,     1], type =    f32, size =    0.027 MB
[  21/1147]          blk.1.attn_kv_a_norm.weight - [  512,     1,     1,     1], type =    f32, size =    0.002 MB
[  22/1147]           blk.1.attn_kv_a_mqa.weight - [ 7168,   576,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.1.attn_kv_a_mqa.weight
converting to q8_0_r8 .. size =     7.88 MiB ->     4.18 MiB
[  23/1147]               blk.1.attn_kv_b.weight - [  512, 32768,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.1.attn_kv_b.weight
converting to q8_0_r8 .. size =    32.00 MiB ->    17.00 MiB
[  24/1147]                blk.1.attn_k_b.weight - [  128, 65536,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.1.attn_k_b.weight

====== llama_model_quantize_internal: did not find weights for blk.1.attn_k_b.weight
converting to q8_0_r8 .. size =    16.00 MiB ->     8.50 MiB
[  25/1147]                blk.1.attn_v_b.weight - [  512, 16384,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.1.attn_v_b.weight
converting to q8_0_r8 .. size =    16.00 MiB ->     8.50 MiB
[  26/1147]             blk.1.attn_output.weight - [16384,  7168,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.1.attn_output.weight
converting to q8_0_r8 .. size =   224.00 MiB ->   119.00 MiB
[  27/1147]           blk.1.attn_q_a_norm.weight - [ 1536,     1,     1,     1], type =    f32, size =    0.006 MB
[  28/1147]                blk.1.attn_q_a.weight - [ 7168,  1536,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.1.attn_q_a.weight
converting to q8_0_r8 .. size =    21.00 MiB ->    11.16 MiB
[  29/1147]                blk.1.attn_q_b.weight - [ 1536, 24576,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.1.attn_q_b.weight
converting to q8_0_r8 .. size =    72.00 MiB ->    38.25 MiB
[  30/1147]               blk.2.attn_norm.weight - [ 7168,     1,     1,     1], type =    f32, size =    0.027 MB
[  31/1147]                blk.2.ffn_down.weight - [18432,  7168,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.2.ffn_down.weight
converting to q8_0_r8 .. size =   252.00 MiB ->   133.88 MiB
[  32/1147]                blk.2.ffn_gate.weight - [ 7168, 18432,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.2.ffn_gate.weight
converting to q8_0_r8 .. size =   252.00 MiB ->   133.88 MiB
[  33/1147]                  blk.2.ffn_up.weight - [ 7168, 18432,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.2.ffn_up.weight
converting to q8_0_r8 .. size =   252.00 MiB ->   133.88 MiB
[  34/1147]                blk.2.ffn_norm.weight - [ 7168,     1,     1,     1], type =    f32, size =    0.027 MB
[  35/1147]          blk.2.attn_kv_a_norm.weight - [  512,     1,     1,     1], type =    f32, size =    0.002 MB
[  36/1147]           blk.2.attn_kv_a_mqa.weight - [ 7168,   576,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.2.attn_kv_a_mqa.weight
converting to q8_0_r8 .. size =     7.88 MiB ->     4.18 MiB
[  37/1147]               blk.2.attn_kv_b.weight - [  512, 32768,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.2.attn_kv_b.weight
converting to q8_0_r8 .. size =    32.00 MiB ->    17.00 MiB
[  38/1147]                blk.2.attn_k_b.weight - [  128, 65536,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.2.attn_k_b.weight

====== llama_model_quantize_internal: did not find weights for blk.2.attn_k_b.weight
converting to q8_0_r8 .. size =    16.00 MiB ->     8.50 MiB
[  39/1147]                blk.2.attn_v_b.weight - [  512, 16384,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.2.attn_v_b.weight
converting to q8_0_r8 .. size =    16.00 MiB ->     8.50 MiB
[  40/1147]             blk.2.attn_output.weight - [16384,  7168,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.2.attn_output.weight
converting to q8_0_r8 .. size =   224.00 MiB ->   119.00 MiB
[  41/1147]           blk.2.attn_q_a_norm.weight - [ 1536,     1,     1,     1], type =    f32, size =    0.006 MB
[  42/1147]                blk.2.attn_q_a.weight - [ 7168,  1536,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.2.attn_q_a.weight
converting to q8_0_r8 .. size =    21.00 MiB ->    11.16 MiB
[  43/1147]                blk.2.attn_q_b.weight - [ 1536, 24576,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.2.attn_q_b.weight
converting to q8_0_r8 .. size =    72.00 MiB ->    38.25 MiB
[  44/1147]               blk.3.exp_probs_b.bias - [  256,     1,     1,     1], type =    f32, size =    0.001 MB
[  45/1147]            blk.3.ffn_gate_inp.weight - [ 7168,   256,     1,     1], type =    f32, size =    7.000 MB
[  46/1147]          blk.3.ffn_down_shexp.weight - [ 2048,  7168,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.3.ffn_down_shexp.weight
converting to q8_0_r8 .. size =    28.00 MiB ->    14.88 MiB
[  47/1147]          blk.3.ffn_gate_shexp.weight - [ 7168,  2048,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.3.ffn_gate_shexp.weight
converting to q8_0_r8 .. size =    28.00 MiB ->    14.88 MiB
[  48/1147]            blk.3.ffn_up_shexp.weight - [ 7168,  2048,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.3.ffn_up_shexp.weight
converting to q8_0_r8 .. size =    28.00 MiB ->    14.88 MiB
[  49/1147]          blk.3.attn_kv_a_norm.weight - [  512,     1,     1,     1], type =    f32, size =    0.002 MB
[  50/1147]           blk.3.attn_kv_a_mqa.weight - [ 7168,   576,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.3.attn_kv_a_mqa.weight
converting to q8_0_r8 .. size =     7.88 MiB ->     4.18 MiB
[  51/1147]               blk.3.attn_kv_b.weight - [  512, 32768,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.3.attn_kv_b.weight
converting to q8_0_r8 .. size =    32.00 MiB ->    17.00 MiB
[  52/1147]                blk.3.attn_k_b.weight - [  128, 65536,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.3.attn_k_b.weight

====== llama_model_quantize_internal: did not find weights for blk.3.attn_k_b.weight
converting to q8_0_r8 .. size =    16.00 MiB ->     8.50 MiB
[  53/1147]                blk.3.attn_v_b.weight - [  512, 16384,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.3.attn_v_b.weight
converting to q8_0_r8 .. size =    16.00 MiB ->     8.50 MiB
[  54/1147]             blk.3.attn_output.weight - [16384,  7168,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.3.attn_output.weight
converting to q8_0_r8 .. size =   224.00 MiB ->   119.00 MiB
[  55/1147]           blk.3.attn_q_a_norm.weight - [ 1536,     1,     1,     1], type =    f32, size =    0.006 MB
[  56/1147]                blk.3.attn_q_a.weight - [ 7168,  1536,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.3.attn_q_a.weight
converting to q8_0_r8 .. size =    21.00 MiB ->    11.16 MiB
[  57/1147]                blk.3.attn_q_b.weight - [ 1536, 24576,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.3.attn_q_b.weight
converting to q8_0_r8 .. size =    72.00 MiB ->    38.25 MiB
[  58/1147]               blk.3.attn_norm.weight - [ 7168,     1,     1,     1], type =    f32, size =    0.027 MB
[  59/1147]           blk.3.ffn_down_exps.weight - [ 2048,  7168,   256,     1], type =   bf16, Using custom type iq5_k_r4 for tensor blk.3.ffn_down_exps.weight
converting to iq5_k_r4 .. size =  7168.00 MiB ->  2464.00 MiB
[  60/1147]           blk.3.ffn_gate_exps.weight - [ 7168,  2048,   256,     1], type =   bf16, Using custom type iq4_k_r4 for tensor blk.3.ffn_gate_exps.weight
converting to iq4_k_r4 .. size =  7168.00 MiB ->  2016.00 MiB
[  61/1147]             blk.3.ffn_up_exps.weight - [ 7168,  2048,   256,     1], type =   bf16, Using custom type iq4_k_r4 for tensor blk.3.ffn_up_exps.weight
converting to iq4_k_r4 .. size =  7168.00 MiB ->  2016.00 MiB
[  62/1147]                blk.3.ffn_norm.weight - [ 7168,     1,     1,     1], type =    f32, size =    0.027 MB
[  63/1147]               blk.4.exp_probs_b.bias - [  256,     1,     1,     1], type =    f32, size =    0.001 MB
[  64/1147]            blk.4.ffn_gate_inp.weight - [ 7168,   256,     1,     1], type =    f32, size =    7.000 MB
[  65/1147]          blk.4.ffn_down_shexp.weight - [ 2048,  7168,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.4.ffn_down_shexp.weight
converting to q8_0_r8 .. size =    28.00 MiB ->    14.88 MiB
[  66/1147]          blk.4.ffn_gate_shexp.weight - [ 7168,  2048,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.4.ffn_gate_shexp.weight
converting to q8_0_r8 .. size =    28.00 MiB ->    14.88 MiB
[  67/1147]            blk.4.ffn_up_shexp.weight - [ 7168,  2048,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.4.ffn_up_shexp.weight
converting to q8_0_r8 .. size =    28.00 MiB ->    14.88 MiB
[  68/1147]          blk.4.attn_kv_a_norm.weight - [  512,     1,     1,     1], type =    f32, size =    0.002 MB
[  69/1147]           blk.4.attn_kv_a_mqa.weight - [ 7168,   576,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.4.attn_kv_a_mqa.weight
converting to q8_0_r8 .. size =     7.88 MiB ->     4.18 MiB
[  70/1147]               blk.4.attn_kv_b.weight - [  512, 32768,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.4.attn_kv_b.weight
converting to q8_0_r8 .. size =    32.00 MiB ->    17.00 MiB
[  71/1147]                blk.4.attn_k_b.weight - [  128, 65536,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.4.attn_k_b.weight

====== llama_model_quantize_internal: did not find weights for blk.4.attn_k_b.weight
converting to q8_0_r8 .. size =    16.00 MiB ->     8.50 MiB
[  72/1147]                blk.4.attn_v_b.weight - [  512, 16384,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.4.attn_v_b.weight
converting to q8_0_r8 .. size =    16.00 MiB ->     8.50 MiB
[  73/1147]             blk.4.attn_output.weight - [16384,  7168,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.4.attn_output.weight
converting to q8_0_r8 .. size =   224.00 MiB ->   119.00 MiB
[  74/1147]           blk.4.attn_q_a_norm.weight - [ 1536,     1,     1,     1], type =    f32, size =    0.006 MB
[  75/1147]                blk.4.attn_q_a.weight - [ 7168,  1536,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.4.attn_q_a.weight
converting to q8_0_r8 .. size =    21.00 MiB ->    11.16 MiB
[  76/1147]                blk.4.attn_q_b.weight - [ 1536, 24576,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.4.attn_q_b.weight
converting to q8_0_r8 .. size =    72.00 MiB ->    38.25 MiB
[  77/1147]               blk.4.attn_norm.weight - [ 7168,     1,     1,     1], type =    f32, size =    0.027 MB
[  78/1147]           blk.4.ffn_down_exps.weight - [ 2048,  7168,   256,     1], type =   bf16, Using custom type iq5_k_r4 for tensor blk.4.ffn_down_exps.weight
converting to iq5_k_r4 .. size =  7168.00 MiB ->  2464.00 MiB
[  79/1147]           blk.4.ffn_gate_exps.weight - [ 7168,  2048,   256,     1], type =   bf16, Using custom type iq4_k_r4 for tensor blk.4.ffn_gate_exps.weight
converting to iq4_k_r4 .. size =  7168.00 MiB ->  2016.00 MiB
[  80/1147]             blk.4.ffn_up_exps.weight - [ 7168,  2048,   256,     1], type =   bf16, Using custom type iq4_k_r4 for tensor blk.4.ffn_up_exps.weight
converting to iq4_k_r4 .. size =  7168.00 MiB ->  2016.00 MiB
[  81/1147]                blk.4.ffn_norm.weight - [ 7168,     1,     1,     1], type =    f32, size =    0.027 MB
[  82/1147]          blk.5.attn_kv_a_norm.weight - [  512,     1,     1,     1], type =    f32, size =    0.002 MB
[  83/1147]           blk.5.attn_kv_a_mqa.weight - [ 7168,   576,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.5.attn_kv_a_mqa.weight
converting to q8_0_r8 .. size =     7.88 MiB ->     4.18 MiB
[  84/1147]               blk.5.attn_kv_b.weight - [  512, 32768,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.5.attn_kv_b.weight
converting to q8_0_r8 .. size =    32.00 MiB ->    17.00 MiB
[  85/1147]                blk.5.attn_k_b.weight - [  128, 65536,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.5.attn_k_b.weight

====== llama_model_quantize_internal: did not find weights for blk.5.attn_k_b.weight

# SNIP text was too long for github issues

====== llama_model_quantize_internal: did not find weights for blk.59.attn_k_b.weight
converting to q8_0_r8 .. size =    16.00 MiB ->     8.50 MiB
[1117/1147]               blk.59.attn_v_b.weight - [  512, 16384,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.59.attn_v_b.weight
converting to q8_0_r8 .. size =    16.00 MiB ->     8.50 MiB
[1118/1147]            blk.59.attn_output.weight - [16384,  7168,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.59.attn_output.weight
converting to q8_0_r8 .. size =   224.00 MiB ->   119.00 MiB
[1119/1147]          blk.59.attn_q_a_norm.weight - [ 1536,     1,     1,     1], type =    f32, size =    0.006 MB
[1120/1147]               blk.59.attn_q_a.weight - [ 7168,  1536,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.59.attn_q_a.weight
converting to q8_0_r8 .. size =    21.00 MiB ->    11.16 MiB
[1121/1147]               blk.59.attn_q_b.weight - [ 1536, 24576,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.59.attn_q_b.weight
converting to q8_0_r8 .. size =    72.00 MiB ->    38.25 MiB
[1122/1147]              blk.59.attn_norm.weight - [ 7168,     1,     1,     1], type =    f32, size =    0.027 MB
[1123/1147]          blk.59.ffn_down_exps.weight - [ 2048,  7168,   256,     1], type =   bf16, Using custom type iq5_k_r4 for tensor blk.59.ffn_down_exps.weight
converting to iq5_k_r4 .. size =  7168.00 MiB ->  2464.00 MiB
[1124/1147]          blk.59.ffn_gate_exps.weight - [ 7168,  2048,   256,     1], type =   bf16, Using custom type iq4_k_r4 for tensor blk.59.ffn_gate_exps.weight
converting to iq4_k_r4 .. size =  7168.00 MiB ->  2016.00 MiB
[1125/1147]            blk.59.ffn_up_exps.weight - [ 7168,  2048,   256,     1], type =   bf16, Using custom type iq4_k_r4 for tensor blk.59.ffn_up_exps.weight
converting to iq4_k_r4 .. size =  7168.00 MiB ->  2016.00 MiB
[1126/1147]               blk.59.ffn_norm.weight - [ 7168,     1,     1,     1], type =    f32, size =    0.027 MB
[1127/1147]              blk.60.exp_probs_b.bias - [  256,     1,     1,     1], type =    f32, size =    0.001 MB
[1128/1147]           blk.60.ffn_gate_inp.weight - [ 7168,   256,     1,     1], type =    f32, size =    7.000 MB
[1129/1147]         blk.60.ffn_down_shexp.weight - [ 2048,  7168,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.60.ffn_down_shexp.weight
converting to q8_0_r8 .. size =    28.00 MiB ->    14.88 MiB
[1130/1147]         blk.60.ffn_gate_shexp.weight - [ 7168,  2048,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.60.ffn_gate_shexp.weight
converting to q8_0_r8 .. size =    28.00 MiB ->    14.88 MiB
[1131/1147]           blk.60.ffn_up_shexp.weight - [ 7168,  2048,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.60.ffn_up_shexp.weight
converting to q8_0_r8 .. size =    28.00 MiB ->    14.88 MiB
[1132/1147]         blk.60.attn_kv_a_norm.weight - [  512,     1,     1,     1], type =    f32, size =    0.002 MB
[1133/1147]          blk.60.attn_kv_a_mqa.weight - [ 7168,   576,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.60.attn_kv_a_mqa.weight
converting to q8_0_r8 .. size =     7.88 MiB ->     4.18 MiB
[1134/1147]              blk.60.attn_kv_b.weight - [  512, 32768,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.60.attn_kv_b.weight
converting to q8_0_r8 .. size =    32.00 MiB ->    17.00 MiB
[1135/1147]               blk.60.attn_k_b.weight - [  128, 65536,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.60.attn_k_b.weight

====== llama_model_quantize_internal: did not find weights for blk.60.attn_k_b.weight
converting to q8_0_r8 .. size =    16.00 MiB ->     8.50 MiB
[1136/1147]               blk.60.attn_v_b.weight - [  512, 16384,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.60.attn_v_b.weight
converting to q8_0_r8 .. size =    16.00 MiB ->     8.50 MiB
[1137/1147]            blk.60.attn_output.weight - [16384,  7168,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.60.attn_output.weight
converting to q8_0_r8 .. size =   224.00 MiB ->   119.00 MiB
[1138/1147]          blk.60.attn_q_a_norm.weight - [ 1536,     1,     1,     1], type =    f32, size =    0.006 MB
[1139/1147]               blk.60.attn_q_a.weight - [ 7168,  1536,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.60.attn_q_a.weight
converting to q8_0_r8 .. size =    21.00 MiB ->    11.16 MiB
[1140/1147]               blk.60.attn_q_b.weight - [ 1536, 24576,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.60.attn_q_b.weight
converting to q8_0_r8 .. size =    72.00 MiB ->    38.25 MiB
[1141/1147]                        output.weight - [ 7168, 129280,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor output.weight

====== llama_model_quantize_internal: did not find weights for output.weight
converting to q8_0_r8 .. size =  1767.50 MiB ->   938.98 MiB
[1142/1147]              blk.60.attn_norm.weight - [ 7168,     1,     1,     1], type =    f32, size =    0.027 MB
[1143/1147]          blk.60.ffn_down_exps.weight - [ 2048,  7168,   256,     1], type =   bf16, Using custom type iq5_k_r4 for tensor blk.60.ffn_down_exps.weight
converting to iq5_k_r4 .. size =  7168.00 MiB ->  2464.00 MiB
[1144/1147]          blk.60.ffn_gate_exps.weight - [ 7168,  2048,   256,     1], type =   bf16, Using custom type iq4_k_r4 for tensor blk.60.ffn_gate_exps.weight
converting to iq4_k_r4 .. size =  7168.00 MiB ->  2016.00 MiB
[1145/1147]            blk.60.ffn_up_exps.weight - [ 7168,  2048,   256,     1], type =   bf16, Using custom type iq4_k_r4 for tensor blk.60.ffn_up_exps.weight
converting to iq4_k_r4 .. size =  7168.00 MiB ->  2016.00 MiB
[1146/1147]               blk.60.ffn_norm.weight - [ 7168,     1,     1,     1], type =    f32, size =    0.027 MB
[1147/1147]                   output_norm.weight - [ 7168,     1,     1,     1], type =    f32, size =    0.027 MB
llama_model_quantize_internal: model size  = 1282038.27 MB
llama_model_quantize_internal: quant size  = 395450.97 MB

main: quantize time = 5308904.06 ms
main:    total time = 5308904.06 ms
```

</details>


<details>

<summary>Perplexity Procedure</summary>

#### Output Logs
```bash
$ numactl -N 1 -m 1 \
./build/bin/llama-perplexity \
    --model /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-CPU-IQ4_K_R4.gguf \
    -ctk q8_0 \
    -mla 3 -fa \
    -amb 512 \
    -fmoe \
    --ctx-size 512 \
    --ubatch-size 512 \
    -f wiki.test.raw \
    --seed 1337 \
    --numa numactl \
    --threads 128

main: build = 3613 (4819257c)
main: built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
main: seed  = 1337
llama_model_loader: loaded meta data with 50 key-value pairs and 1147 tensors from /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-CPU-
IQ4_K_R4.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = deepseek2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = DeepSeek V3 0324
llama_model_loader: - kv   3:                            general.version str              = V3-0324
llama_model_loader: - kv   4:                           general.basename str              = DeepSeek
llama_model_loader: - kv   5:                         general.size_label str              = 256x21B
llama_model_loader: - kv   6:                            general.license str              = mit
llama_model_loader: - kv   7:                      deepseek2.block_count u32              = 61
llama_model_loader: - kv   8:                   deepseek2.context_length u32              = 163840
llama_model_loader: - kv   9:                 deepseek2.embedding_length u32              = 7168
llama_model_loader: - kv  10:              deepseek2.feed_forward_length u32              = 18432
llama_model_loader: - kv  11:             deepseek2.attention.head_count u32              = 128
llama_model_loader: - kv  12:          deepseek2.attention.head_count_kv u32              = 128
llama_model_loader: - kv  13:                   deepseek2.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  14: deepseek2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  15:                deepseek2.expert_used_count u32              = 8
llama_model_loader: - kv  16:                          general.file_type u32              = 340
llama_model_loader: - kv  17:        deepseek2.leading_dense_block_count u32              = 3
llama_model_loader: - kv  18:                       deepseek2.vocab_size u32              = 129280
llama_model_loader: - kv  19:            deepseek2.attention.q_lora_rank u32              = 1536
llama_model_loader: - kv  20:           deepseek2.attention.kv_lora_rank u32              = 512
llama_model_loader: - kv  21:             deepseek2.attention.key_length u32              = 192
llama_model_loader: - kv  22:           deepseek2.attention.value_length u32              = 128
llama_model_loader: - kv  23:       deepseek2.expert_feed_forward_length u32              = 2048
llama_model_loader: - kv  24:                     deepseek2.expert_count u32              = 256
llama_model_loader: - kv  25:              deepseek2.expert_shared_count u32              = 1
llama_model_loader: - kv  26:             deepseek2.expert_weights_scale f32              = 2.500000
llama_model_loader: - kv  27:              deepseek2.expert_weights_norm bool             = true
llama_model_loader: - kv  28:               deepseek2.expert_gating_func u32              = 2
llama_model_loader: - kv  29:             deepseek2.rope.dimension_count u32              = 64
llama_model_loader: - kv  30:                deepseek2.rope.scaling.type str              = yarn
llama_model_loader: - kv  31:              deepseek2.rope.scaling.factor f32              = 40.000000
llama_model_loader: - kv  32: deepseek2.rope.scaling.original_context_length u32              = 4096
llama_model_loader: - kv  33: deepseek2.rope.scaling.yarn_log_multiplier f32              = 0.100000
llama_model_loader: - kv  34:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  35:                         tokenizer.ggml.pre str              = deepseek-v3
llama_model_loader: - kv  36:                      tokenizer.ggml.tokens arr[str,129280]  = ["
llama_model_loader: - kv  37:                  tokenizer.ggml.token_type arr[i32,129280]  = [3
llama_model_loader: - kv  38:                      tokenizer.ggml.merges arr[str,127741]  = ["
llama_model_loader: - kv  39:                tokenizer.ggml.bos_token_id u32              = 0
llama_model_loader: - kv  40:                tokenizer.ggml.eos_token_id u32              = 1
llama_model_loader: - kv  41:            tokenizer.ggml.padding_token_id u32              = 1
llama_model_loader: - kv  42:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  43:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  44:                    tokenizer.chat_template str              = {% if not add_generation_prompt is de...
llama_model_loader: - kv  45:               general.quantization_version u32              = 2
llama_model_loader: - kv  46:                      quantize.imatrix.file str              = /mnt/raid/models/ubergarm/DeepSeek-V3...
llama_model_loader: - kv  47:                   quantize.imatrix.dataset str              = calibration_data_v5_rc.txt
llama_model_loader: - kv  48:             quantize.imatrix.entries_count i32              = 720
llama_model_loader: - kv  49:              quantize.imatrix.chunks_count i32              = 213
llama_model_loader: - type  f32:  361 tensors
llama_model_loader: - type q8_0_r8:  612 tensors
llama_model_loader: - type iq4_k_r4:  116 tensors
llama_model_loader: - type iq5_k_r4:   58 tensors
llm_load_vocab: special tokens cache size = 818
llm_load_vocab: token to piece cache size = 0.8223 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = deepseek2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 129280
llm_load_print_meta: n_merges         = 127741
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 163840
llm_load_print_meta: n_embd           = 7168
llm_load_print_meta: n_layer          = 61
llm_load_print_meta: n_head           = 128
llm_load_print_meta: n_head_kv        = 128
llm_load_print_meta: n_rot            = 64
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 192
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 24576
llm_load_print_meta: n_embd_v_gqa     = 16384
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 18432
llm_load_print_meta: n_expert         = 256
llm_load_print_meta: n_expert_used    = 8
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = yarn
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 0.025
llm_load_print_meta: n_ctx_orig_yarn  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 671B
llm_load_print_meta: model ftype      = IQ4_K_R4 - 4.5 bpw
llm_load_print_meta: model params     = 672.050 B
llm_load_print_meta: model size       = 386.183 GiB (4.936 BPW)
llm_load_print_meta: repeating layers = 384.349 GiB (4.926 BPW, 670.196 B parameters)
llm_load_print_meta: general.name     = DeepSeek V3 0324
llm_load_print_meta: BOS token        = 0 '<｜begin▁of▁sentence｜>'
llm_load_print_meta: EOS token        = 1 '<｜end▁of▁sentence｜>'
llm_load_print_meta: PAD token        = 1 '<｜end▁of▁sentence｜>'
llm_load_print_meta: LF token         = 131 'Ä'
llm_load_print_meta: max token length = 256
llm_load_print_meta: n_layer_dense_lead   = 3
llm_load_print_meta: n_lora_q             = 1536
llm_load_print_meta: n_lora_kv            = 512
llm_load_print_meta: n_ff_exp             = 2048
llm_load_print_meta: n_expert_shared      = 1
llm_load_print_meta: expert_weights_scale = 2.5
llm_load_print_meta: expert_weights_norm  = 1
llm_load_print_meta: expert_gating_func   = sigmoid
llm_load_print_meta: rope_yarn_log_mul    = 0.1000
llm_load_tensors: ggml ctx size =    0.47 MiB
llm_load_tensors:        CPU buffer size = 395450.97 MiB
....................................................................................................
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn   = 3
llama_new_context_with_model: attn_max_b = 512
llama_new_context_with_model: fused_moe  = 1
llama_new_context_with_model: ser        = -1, 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 0.025
llama_kv_cache_init: layer 0: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 1: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 2: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 3: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 4: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 5: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 6: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 7: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 8: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 9: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 10: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 11: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 12: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 13: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 14: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 15: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 16: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 17: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 18: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 19: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 20: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 21: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 22: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 23: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 24: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 25: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 26: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 27: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 28: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 29: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 30: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 31: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 32: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 33: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 34: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 35: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 36: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 37: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 38: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 39: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 40: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 41: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 42: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 43: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 44: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 45: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 46: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 47: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 48: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 49: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 50: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 51: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 52: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 53: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 54: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 55: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 56: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 57: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 58: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 59: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 60: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init:        CPU KV buffer size =    72.91 MiB
llama_new_context_with_model: KV self size  =   72.91 MiB, c^KV (q8_0):   72.91 MiB, kv^T: not used
llama_new_context_with_model:        CPU  output buffer size =     1.97 MiB
llama_new_context_with_model:        CPU compute buffer size =   450.01 MiB
llama_new_context_with_model: graph nodes  = 3487
llama_new_context_with_model: graph splits = 1

system_info: n_threads = 128 / 512 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 |
NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE =
1 |
perplexity: tokenizing the input ..
perplexity: tokenization took 928.692 ms
perplexity: calculating perplexity over 561 chunks, n_ctx=512, batch_size=2048, n_seq=4
perplexity: 15.08 seconds per pass - ETA 35.23 minutes
[1]621042.4845,[2]480288.4154,[3]384849.5504,[4]411291.6749,[5]342382.0527,[6]347496.7446,[7]338598.0612,[8]338938.1630,[9]343341.0863,[10]329407.7871
,[11]328794.0950,[12]349036.2429,[13]339812.6162,[14]327127.2843,[15]318294.3349,[16]320629.0762,[17]318911.2283,[18]306946.2653,[19]320742.9747,[20]3
20520.4166,[21]323369.9752,[22]321108.7583,[23]320950.8245,[24]323537.1597,[25]313530.9380,[26]307858.8254,[27]305584.6174,[28]304930.6946,[29]319325.
7633,[30]316463.6020,[31]318028.8556,[32]323730.7568,[33]336376.3859,[34]338644.6368,[35]341295.5596,[36]346582.1772,[37]343638.6921,[38]346920.0126,[
39]346553.2755,[40]339975.1907,[41]338080.6482,[42]341607.0511,[43]342165.4351,[44]343495.4481,[45]341683.0497,[46]341841.5203,[47]341968.2578,[48]341
018.8794,[49]337906.3680,[50]340880.4017,[51]343264.7780,[52]341172.5260,[53]341895.8030,[54]342362.6716,[55]339077.7577,[56]337472.3629,[57]338597.41
79,[58]338840.4233,[59]340391.7068,[60]341329.9617,[61]338907.2644,[62]338654.8390,[63]340597.6581,[64]341464.0272,[65]339761.5866,[66]337473.1508,[67
]334628.2254,[68]335027.1919,[69]336085.7135,[70]334748.0318,[71]334310.4754,[72]332610.8172,[73]331121.5117,[74]331604.3876,[75]331320.1529,[76]33491
0.8814,[77]336051.4006,[78]335753.6115,[79]337362.5269,[80]335564.3466,[81]332456.8750,[82]331609.4385,[83]333316.4520,[84]335084.6156,[85]334711.4110
,[86]334160.7888,[87]332126.7278,[88]331597.7024,[89]331461.8908,[90]330703.9912,[91]331143.7667,[92]328566.8218,[93]327220.3991,[94]327306.2202,[95]3
28760.6069,[96]331831.1512,[97]331100.4377,[98]331676.2039,[99]331115.3237,[100]332922.5225,[101]330521.2050,[102]330638.9063,[103]330508.2943,[104]33
3336.3249,[105]332252.4134,[106]331511.8882,[107]331478.9005,[108]330800.7499,[109]331643.0452,[110]332295.2747,[111]331716.4016,[112]333145.4543,[113
]332446.6042,[114]332605.4088,[115]334144.7878,[116]334062.6775,[117]334795.9300,[118]335185.6388,[119]336442.8975,[120]336288.3524,[121]337854.3067,[
122]342121.8593,[123]342443.4687,[124]343659.0524,[125]344785.3775,[126]345809.3526,[127]347207.6305,[128]348210.4479,[129]349672.3288,[130]350221.461
2,[131]350215.0059,[132]352167.2450,[133]351660.6672,[134]353361.5754,[135]354848.8108,[136]353175.7897,[137]353870.5511,[138]355061.4101,[139]355874.
4197,[140]356669.3123,[141]355293.1474,[142]354584.2063,[143]353505.6443,[144]354011.7258,[145]352950.0290,[146]352775.3758,[147]350332.0398,[148]3489
19.1460,[149]348589.1782,[150]348457.2881,[151]347884.5859,[152]347551.9711,[153]346394.1977,[154]345076.4034,[155]342799.4862,[156]342481.4941,[157]3
42472.8007,[158]341437.5809,[159]341069.4855,[160]340176.4801,[161]340547.0153,[162]341245.8648,[163]340449.0528,[164]339162.6069,[165]339049.6867,[16
6]340108.0202,[167]338993.8220,[168]338633.1774,[169]337653.7408,[170]337330.2507,[171]337964.2748,[172]336817.5461,[173]335656.4557,[174]335356.9395,
[175]335636.9791,[176]336962.6238,[177]336571.5140,[178]336611.6326,[179]336169.1428,[180]337152.8681,[181]336928.3568,[182]337374.7017,[183]336574.88
30,[184]336549.1612,[185]336890.1861,[186]336270.8240,[187]336033.7314,[188]336260.7362,[189]336337.6063,[190]335905.2686,[191]335671.5326,[192]336063
.9825,[193]336254.3945,[194]336390.3271,[195]336058.7223,[196]336123.5871,[197]336272.6905,[198]336581.7609,[199]336125.9311,[200]336175.1478,[201]335
261.2004,[202]335722.4991,[203]335732.0036,[204]336010.6380,[205]336554.9746,[206]336870.3485,[207]337512.5650,[208]337800.7907,[209]337957.8198,[210]
339006.8855,[211]339536.3558,[212]339771.6654,[213]339820.9878,[214]340649.4873,[215]340871.1208,[216]341088.6222,[217]340871.9526,[218]340944.1487,[2
19]341612.6012,[220]342518.8541,[221]342988.1971,[222]342574.7840,[223]343481.4894,[224]343029.3821,[225]343295.2932,[226]343032.9993,[227]343704.6932
,[228]345175.9576,[229]345567.2666,[230]346984.2971,[231]347891.9790,[232]348421.3554,[233]347906.3728,[234]348105.3882,[235]347709.6448,[236]347865.7
097,[237]347051.5113,[238]347476.0560,[239]348607.8464,[240]347950.9243,[241]348175.2049,[242]348260.1216,[243]348118.1121,[244]349105.7627,[245]35034
3.6532,[246]351018.4541,[247]349972.1138,[248]349626.9985,[249]349815.8200,[250]349784.0491,[251]349044.6743,[252]348851.4149,[253]347922.8042,[254]34
7737.7496,[255]347553.6986,[256]347998.6214,[257]348681.4274,[258]348605.3748,[259]347746.3318,[260]347249.1009,[261]347208.6900,[262]346804.7642,[263
]346325.7216,[264]345906.9311,[265]345908.3860,[266]345701.0113,[267]345709.4001,[268]345912.5002,[269]346098.0048,[270]345980.1661,[271]345810.4070,[
272]345554.0991,[273]345337.1543,[274]344923.7055,[275]344460.3920,[276]343342.6230,[277]343576.3771,[278]342718.8707,[279]342988.6333,[280]343045.420
5,[281]342954.1471,[282]343121.6664,[283]343447.0750,[284]343345.1687,[285]343518.5285,[286]343098.9947,[287]342822.1719,[288]342853.3967,[289]343641.
2162,[290]343374.6100,[291]343746.9794,[292]343718.3872,[293]343928.4375,[294]344298.2272,[295]344357.2789,[296]344897.7471,[297]343889.5777,[298]3443
89.0557,[299]345317.8505,[300]344843.8735,[301]345089.1796,[302]345391.7513,[303]344981.9309,[304]345274.1943,[305]345361.9946,[306]344615.1515,[307]3
44191.7641,[308]344244.3699,[309]343919.6349,[310]344199.1177,[311]344405.9163,[312]344450.0979,[313]344439.8224,[314]344141.4730,[315]342825.3627,[31
6]341433.4296,[317]340663.0907,[318]339582.1865,[319]338423.3959,[320]338431.9492,[321]338115.6464,[322]337707.7252,[323]337509.5115,[324]337143.1945,
[325]336863.2449,[326]336823.7532,[327]336944.8010,[328]336631.8671,[329]335992.6150,[330]335818.9447,[331]335230.9186,[332]335293.0504,[333]334905.10
22,[334]335016.8497,[335]334882.2233,[336]335010.3878,[337]334898.4524,[338]334669.4391,[339]334527.0858,[340]334121.5989,[341]333836.9861,[342]334106
.1635,[343]334063.7962,[344]334203.4633,[345]334543.9787,[346]334077.9966,[347]334284.0650,[348]334445.7269,[349]334827.9118,[350]334821.3506,[351]334
479.8770,[352]334176.5657,[353]334025.4542,[354]333939.9035,[355]333898.6704,[356]333624.9149,[357]333237.7507,[358]333661.4850,[359]334098.6600,[360]
334318.0128,[361]334045.3073,[362]333919.0924,[363]333648.6163,[364]334117.8579,[365]334137.6652,[366]334344.9832,[367]334292.8768,[368]334416.0816,[3
69]334236.0430,[370]334155.9937,[371]333734.8777,[372]334073.4287,[373]333972.2325,[374]333610.6319,[375]333627.4234,[376]333967.3869,[377]334455.1315
,[378]334648.7305,[379]334723.9790,[380]334915.8106,[381]334783.0520,[382]334792.9807,[383]334292.3066,[384]334761.0592,[385]334650.0049,[386]334250.9
363,[387]334130.7030,[388]334962.6261,[389]335103.6648,[390]334964.4796,[391]335155.0150,[392]335258.2591,[393]335715.2107,[394]336216.3549,[395]33678
4.9280,[396]336825.6375,[397]336514.6311,[398]336291.0403,[399]335938.5148,[400]335934.1942,[401]336392.6242,[402]335974.0197,[403]336289.9238,[404]33
6379.4946,[405]336555.6353,[406]336369.9217,[407]336264.4100,[408]336306.2972,[409]336062.0189,[410]336218.9131,[411]335872.2278,[412]335754.9736,[413
]335586.0973,[414]335124.5066,[415]335378.1566,[416]335487.5042,[417]335712.7851,[418]335428.0417,[419]335734.1041,[420]336284.5707,[421]336296.1309,[
422]335716.1559,[423]335819.8443,[424]335746.8833,[425]335446.8556,[426]335455.4698,[427]335421.7328,[428]335308.4573,[429]335308.3605,[430]335634.427
1,[431]335941.7238,[432]335805.4835,[433]335864.1890,[434]335795.2289,[435]335790.3390,[436]336183.7092,[437]336053.6280,[438]336412.7182,[439]336779.
1893,[440]336638.0088,[441]336696.3587,[442]336693.5864,[443]336947.3901,[444]337364.4074,[445]337188.6797,[446]336960.3097,[447]336982.3581,[448]3367
40.4896,[449]336800.7335,[450]337456.5018,[451]337628.6795,[452]338075.0179,[453]338217.9506,[454]338563.8328,[455]338449.4376,[456]338244.9696,[457]3
38254.9905,[458]337899.0490,[459]338065.0851,[460]338084.4375,[461]338013.8557,[462]337774.4167,[463]338030.2594,[464]337997.7621,[465]338313.0132,[46
6]338480.3486,[467]338553.1094,[468]338698.8431,[469]338961.8873,[470]339099.5448,[471]339529.5247,[472]339518.9106,[473]339533.8010,[474]339280.8227,
[475]339337.3000,[476]339614.2696,[477]339436.1779,[478]339499.3813,[479]339569.9636,[480]339304.3727,[481]339458.5688,[482]339531.7829,[483]339698.45
70,[484]339156.1393,[485]339477.7685,[486]340238.3424,[487]340379.7815,[488]340655.9210,[489]340516.3203,[490]340570.0327,[491]340506.7411,[492]340278
.8962,[493]340258.7227,[494]340450.1686,[495]339995.1085,[496]340057.2055,[497]340209.0422,[498]339943.5230,[499]339784.5338,[500]339990.5147,[501]339
970.8131,[502]340371.5679,[503]340059.3617,[504]339792.6366,[505]339453.2254,[506]339424.0224,[507]339627.8620,[508]339683.1626,[509]339688.5786,[510]
339971.3743,[511]340134.1403,[512]340558.5657,[513]340734.9633,[514]341007.3962,[515]341043.8739,[516]341339.0372,[517]341604.4826,[518]341228.6644,[5
19]340909.3084,[520]340917.5889,[521]340871.2405,[522]340629.4603,[523]340600.1478,[524]340494.6514,[525]339985.5894,[526]339798.1336,[527]339423.1168
,[528]339574.7999,[529]338999.3788,[530]338866.6454,[531]339064.2290,[532]338175.7611,[533]338193.8181,[534]338591.1751,[535]338794.1938,[536]338815.3
925,[537]338854.7276,[538]338997.8122,[539]339560.6960,[540]339563.1839,[541]339606.7486,[542]339558.3348,[543]339493.1708,[544]339729.4373,[545]34020
8.8763,[546]340231.7345,[547]340359.0196,[548]340906.6126,[549]341063.1162,[550]341158.9496,[551]341645.1513,[552]341690.2990,[553]341566.8309,[554]34
1969.4067,[555]341819.3313,[556]341737.7033,[557]341893.9760,[558]341486.6305,[559]341186.3327,[560]340936.6909,[561]340925.0560,
llama_print_timings:        load time =    2238.45 ms
llama_print_timings:      sample time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time = 2214337.48 ms / 287232 tokens (    7.71 ms per token,   129.71 tokens per second)
llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time = 2264778.41 ms / 287233 tokens

Final estimate: PPL = 340925.0560 +/- 2519.12041
```

</details>


<details>

<summary>llama-server response to chat client looks wrong</summary>

I tried various combinations of server configs and all yielded same wrong looking responses in client.

#### Start Server
```bash
#### First attempt
numactl -N 0 -m 0 \
./build/bin/llama-server \
    --model /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-CPU-IQ4_K_R4.gguf \
    --alias ubergarm/DeepSeek-V3-0324-CPU-IQ4_K_R4 \
    --ctx-size 8192 \
    -ctk q8_0 \
    -mla 3 -fa \ # also tried -mla 2
    -amb 2048 \
    -fmoe \
    --temp 0.3 \
    --parallel 1 \
    --threads 128 \
    --numa numactl \
    --host 127.0.0.1 \
    --port 8080

#### Second attempt
numactl -N 0 -m 0 \
./build/bin/llama-server \
    --model /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-CPU-IQ4_K_R4.gguf \
    --alias ubergarm/DeepSeek-V3-0324-CPU-IQ4_K_R4 \
    --ctx-size 8192 \
    --parallel 1 \
    --threads 128 \
    --numa numactl \
    --host 127.0.0.1 \
    --port 8080
```

#### Start Client
```bash
$ python dchat.py
Input prompt then press Ctrl+D twice (or once on empty line) to send.
Ctrl+C to cancel response or twice to exit.

>>> User:

Count from 1 to 10 in French.

>>> Assistant:

AlrightAlrightAlrightAlright
>>> User:

^C^C

Exiting...
```

</details>


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Possible numerical stability issue with experimental quant of DeepSeek-V3-0324? #296

tl;dr;

Details

Logs

Quantization Recipe Script

Output Logs

Output Logs

Start Server

Start Client

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Possible numerical stability issue with experimental quant of DeepSeek-V3-0324? #296

Description

tl;dr;

Details

Logs

Quantization Recipe Script

Output Logs

Output Logs

Start Server

Start Client

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions