Skip to content

Possible numerical stability issue with experimental quant of DeepSeek-V3-0324? #296

@ubergarm

Description

@ubergarm

tl;dr;

UPDATE: skip to the end, I probably shouldn't use q8_0_r8 for token_embd.weight and just leave that q8_0.

I cooked up a DeepSeek-V3-0324 quant specificly for CPU only inferencing on the xeon 6980P rig and am getting very large perplexity values and broken llama-server responses.

Not sure if user error, an invalid recipe, or if there is some issue with computing one of the quant types etc.

Details

This was my intended recipe mix:

  • q8_0_r8 for all the embeddings, attention, norms, bias, and shared experts tensors
  • q5_k_r4 for all routed MoE down projection tensors
  • q4_k_r4 for all routed MoE gate/up tensors

This is what is reported when starting up with it:

llama_model_loader: - type  f32:  361 tensors
llama_model_loader: - type q8_0_r8:  612 tensors
llama_model_loader: - type iq4_k_r4:  116 tensors
llama_model_loader: - type iq5_k_r4:   58 tensors

I'm not 100% sure if the issue could be with the q5_k_r4 or q4_k_r4 inferencing CPU computation possibly? Or maybe I messed up somewhere in my scripts.

Potentially relevent topics:

  1. Recent PR292 seems to have fixed the previous issue with q8_0 numerical stability.
  2. I asked @saood06 as he has been experimenting with these quants in our discussion here.

Logs

I've provided logs of quantization, perplexity, and llama-server below for reference.

Everything rebuilt and run on updated ik_llama.cpp/main@4819257c.

Quantization Procedure

Quantization Recipe Script

#!/usr/bin/env bash

custom="
# Token embedding and output tensors
token_embd\.weight=q8_0_r8
output\.weight=q8_0_r8
output_norm\.weight=q8_0_r8

# First 3 dense layers (0-3)
blk\.[0-2]\..*=q8_0_r8

# All attention, norm weights, and bias tensors for MoE layers (3-60)
blk\.[3-9]\.attn_.*=q8_0_r8
blk\.[1-5][0-9]\.attn_.*=q8_0_r8
blk\.60\.attn_.*=q8_0_r8

blk\.[3-9]\.ffn_norm\.weight=q8_0_r8
blk\.[1-5][0-9]\.ffn_norm\.weight=q8_0_r8
blk\.60\.ffn_norm\.weight=q8_0_r8

blk\.[3-9]\.exp_probs_b\.bias=q8_0_r8
blk\.[1-5][0-9]\.exp_probs_b\.bias=q8_0_r8
blk\.60\.exp_probs_b\.bias=q8_0_r8

# Shared Experts (3-60)
blk\.[3-9]\.ffn_down_shexp\.weight=q8_0_r8
blk\.[1-5][0-9]\.ffn_down_shexp\.weight=q8_0_r8
blk\.60\.ffn_down_shexp\.weight=q8_0_r8

blk\.[3-9]\.ffn_(gate|up)_shexp\.weight=q8_0_r8
blk\.[1-5][0-9]\.ffn_(gate|up)_shexp\.weight=q8_0_r8
blk\.60\.ffn_(gate|up)_shexp\.weight=q8_0_r8

# MoE Experts (3-60)
blk\.[3-9]\.ffn_down_exps\.weight=iq5_k_r4
blk\.[1-5][0-9]\.ffn_down_exps\.weight=iq5_k_r4
blk\.60\.ffn_down_exps\.weight=iq5_k_r4

blk\.[3-9]\.ffn_(gate|up)_exps\.weight=iq4_k_r4
blk\.[1-5][0-9]\.ffn_(gate|up)_exps\.weight=iq4_k_r4
blk\.60\.ffn_(gate|up)_exps\.weight=iq4_k_r4
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

./build/bin/llama-quantize \
    --imatrix /mnt/raid/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324.imatrix \
    --token-embedding-type q8_0_r8 \
    --output-tensor-type q8_0_r8 \
    --custom-q "$custom" \
    /mnt/raid/models/deepseek-ai/DeepSeek-V3-0324-bf16-GGUF/DeepSeek-256x21B-V3-0324-BF16-00001-of-00030.gguf \
    /mnt/raid/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-CPU-IQ4_K_R4.gguf \
    IQ4_K_R4 \
    24

Output Logs

main: build = 3613 (4819257c)
main: built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
main: quantizing '/mnt/raid/models/deepseek-ai/DeepSeek-V3-0324-bf16-GGUF/DeepSeek-256x21B-V3-0324-BF16-00001-of-00030.gguf' to '/mnt/raid/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-CPU-IQ4_K_R4.gguf' as IQ4_K_R4 using 24 threads
llama_model_loader: additional 29 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 49 key-value pairs and 1147 tensors from /mnt/raid/models/deepseek-ai/DeepSeek-V3-0324-bf16-GGUF/DeepSeek-256x21B-V3-0324-BF16-00001-of-00030.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = deepseek2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = DeepSeek V3 0324
llama_model_loader: - kv   3:                            general.version str              = V3-0324
llama_model_loader: - kv   4:                           general.basename str              = DeepSeek
llama_model_loader: - kv   5:                         general.size_label str              = 256x21B
llama_model_loader: - kv   6:                            general.license str              = mit
llama_model_loader: - kv   7:                      deepseek2.block_count u32              = 61
llama_model_loader: - kv   8:                   deepseek2.context_length u32              = 163840
llama_model_loader: - kv   9:                 deepseek2.embedding_length u32              = 7168
llama_model_loader: - kv  10:              deepseek2.feed_forward_length u32              = 18432
llama_model_loader: - kv  11:             deepseek2.attention.head_count u32              = 128
llama_model_loader: - kv  12:          deepseek2.attention.head_count_kv u32              = 128
llama_model_loader: - kv  13:                   deepseek2.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  14: deepseek2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  15:                deepseek2.expert_used_count u32              = 8
llama_model_loader: - kv  16:                          general.file_type u32              = 32
llama_model_loader: - kv  17:        deepseek2.leading_dense_block_count u32              = 3
llama_model_loader: - kv  18:                       deepseek2.vocab_size u32              = 129280
llama_model_loader: - kv  19:            deepseek2.attention.q_lora_rank u32              = 1536
llama_model_loader: - kv  20:           deepseek2.attention.kv_lora_rank u32              = 512
llama_model_loader: - kv  21:             deepseek2.attention.key_length u32              = 192
llama_model_loader: - kv  22:           deepseek2.attention.value_length u32              = 128
llama_model_loader: - kv  23:       deepseek2.expert_feed_forward_length u32              = 2048
llama_model_loader: - kv  24:                     deepseek2.expert_count u32              = 256
llama_model_loader: - kv  25:              deepseek2.expert_shared_count u32              = 1
llama_model_loader: - kv  26:             deepseek2.expert_weights_scale f32              = 2.500000
llama_model_loader: - kv  27:              deepseek2.expert_weights_norm bool             = true
llama_model_loader: - kv  28:               deepseek2.expert_gating_func u32              = 2
llama_model_loader: - kv  29:             deepseek2.rope.dimension_count u32              = 64
llama_model_loader: - kv  30:                deepseek2.rope.scaling.type str              = yarn
llama_model_loader: - kv  31:              deepseek2.rope.scaling.factor f32              = 40.000000
llama_model_loader: - kv  32: deepseek2.rope.scaling.original_context_length u32              = 4096
llama_model_loader: - kv  33: deepseek2.rope.scaling.yarn_log_multiplier f32              = 0.100000
llama_model_loader: - kv  34:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  35:                         tokenizer.ggml.pre str              = deepseek-v3
llama_model_loader: - kv  36:                      tokenizer.ggml.tokens arr[str,129280]  = ["
llama_model_loader: - kv  37:                  tokenizer.ggml.token_type arr[i32,129280]  = [3
llama_model_loader: - kv  38:                      tokenizer.ggml.merges arr[str,127741]  = ["
llama_model_loader: - kv  39:                tokenizer.ggml.bos_token_id u32              = 0
llama_model_loader: - kv  40:                tokenizer.ggml.eos_token_id u32              = 1
llama_model_loader: - kv  41:            tokenizer.ggml.padding_token_id u32              = 1
llama_model_loader: - kv  42:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  43:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  44:                    tokenizer.chat_template str              = {% if not add_generation_prompt is de...
llama_model_loader: - kv  45:               general.quantization_version u32              = 2
llama_model_loader: - kv  46:                                   split.no u16              = 0
llama_model_loader: - kv  47:                                split.count u16              = 30
llama_model_loader: - kv  48:                        split.tensors.count i32              = 1147
llama_model_loader: - type  f32:  361 tensors
llama_model_loader: - type bf16:  786 tensors
================================ Have weights data with 720 entries
[   1/1147]                    token_embd.weight - [ 7168, 129280,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor token_embd.weight

====== llama_model_quantize_internal: did not find weights for token_embd.weight
converting to q8_0_r8 .. Adding custom rule token_embd\.weight -> q8_0_r8
Adding custom rule output\.weight -> q8_0_r8
Adding custom rule output_norm\.weight -> q8_0_r8
Adding custom rule blk\.[0-2]\..* -> q8_0_r8
Adding custom rule blk\.[3-9]\.attn_.* -> q8_0_r8
Adding custom rule blk\.[1-5][0-9]\.attn_.* -> q8_0_r8
Adding custom rule blk\.60\.attn_.* -> q8_0_r8
Adding custom rule blk\.[3-9]\.ffn_norm\.weight -> q8_0_r8
Adding custom rule blk\.[1-5][0-9]\.ffn_norm\.weight -> q8_0_r8
Adding custom rule blk\.60\.ffn_norm\.weight -> q8_0_r8
Adding custom rule blk\.[3-9]\.exp_probs_b\.bias -> q8_0_r8
Adding custom rule blk\.[1-5][0-9]\.exp_probs_b\.bias -> q8_0_r8
Adding custom rule blk\.60\.exp_probs_b\.bias -> q8_0_r8
Adding custom rule blk\.[3-9]\.ffn_down_shexp\.weight -> q8_0_r8
Adding custom rule blk\.[1-5][0-9]\.ffn_down_shexp\.weight -> q8_0_r8
Adding custom rule blk\.60\.ffn_down_shexp\.weight -> q8_0_r8
Adding custom rule blk\.[3-9]\.ffn_(gate|up)_shexp\.weight -> q8_0_r8
Adding custom rule blk\.[1-5][0-9]\.ffn_(gate|up)_shexp\.weight -> q8_0_r8
Adding custom rule blk\.60\.ffn_(gate|up)_shexp\.weight -> q8_0_r8
Adding custom rule blk\.[3-9]\.ffn_down_exps\.weight -> iq5_k_r4
Adding custom rule blk\.[1-5][0-9]\.ffn_down_exps\.weight -> iq5_k_r4
Adding custom rule blk\.60\.ffn_down_exps\.weight -> iq5_k_r4
Adding custom rule blk\.[3-9]\.ffn_(gate|up)_exps\.weight -> iq4_k_r4
Adding custom rule blk\.[1-5][0-9]\.ffn_(gate|up)_exps\.weight -> iq4_k_r4
Adding custom rule blk\.60\.ffn_(gate|up)_exps\.weight -> iq4_k_r4
load_imatrix: imatrix dataset='calibration_data_v5_rc.txt'
load_imatrix: loaded 720 importance matrix entries from /mnt/raid/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324.imatrix computed on 213 chunks
prepare_imatrix: have 720 importance matrix entries
size =  1767.50 MiB ->   938.98 MiB
[   2/1147]               blk.0.attn_norm.weight - [ 7168,     1,     1,     1], type =    f32, size =    0.027 MB
[   3/1147]                blk.0.ffn_down.weight - [18432,  7168,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.0.ffn_down.weight
converting to q8_0_r8 .. size =   252.00 MiB ->   133.88 MiB
[   4/1147]                blk.0.ffn_gate.weight - [ 7168, 18432,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.0.ffn_gate.weight
converting to q8_0_r8 .. size =   252.00 MiB ->   133.88 MiB
[   5/1147]                  blk.0.ffn_up.weight - [ 7168, 18432,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.0.ffn_up.weight
converting to q8_0_r8 .. size =   252.00 MiB ->   133.88 MiB
[   6/1147]                blk.0.ffn_norm.weight - [ 7168,     1,     1,     1], type =    f32, size =    0.027 MB
[   7/1147]          blk.0.attn_kv_a_norm.weight - [  512,     1,     1,     1], type =    f32, size =    0.002 MB
[   8/1147]           blk.0.attn_kv_a_mqa.weight - [ 7168,   576,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.0.attn_kv_a_mqa.weight
converting to q8_0_r8 .. size =     7.88 MiB ->     4.18 MiB
[   9/1147]               blk.0.attn_kv_b.weight - [  512, 32768,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.0.attn_kv_b.weight
converting to q8_0_r8 .. size =    32.00 MiB ->    17.00 MiB
[  10/1147]                blk.0.attn_k_b.weight - [  128, 65536,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.0.attn_k_b.weight

====== llama_model_quantize_internal: did not find weights for blk.0.attn_k_b.weight
converting to q8_0_r8 .. size =    16.00 MiB ->     8.50 MiB
[  11/1147]                blk.0.attn_v_b.weight - [  512, 16384,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.0.attn_v_b.weight
converting to q8_0_r8 .. size =    16.00 MiB ->     8.50 MiB
[  12/1147]             blk.0.attn_output.weight - [16384,  7168,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.0.attn_output.weight
converting to q8_0_r8 .. size =   224.00 MiB ->   119.00 MiB
[  13/1147]           blk.0.attn_q_a_norm.weight - [ 1536,     1,     1,     1], type =    f32, size =    0.006 MB
[  14/1147]                blk.0.attn_q_a.weight - [ 7168,  1536,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.0.attn_q_a.weight
converting to q8_0_r8 .. size =    21.00 MiB ->    11.16 MiB
[  15/1147]                blk.0.attn_q_b.weight - [ 1536, 24576,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.0.attn_q_b.weight
converting to q8_0_r8 .. size =    72.00 MiB ->    38.25 MiB
[  16/1147]               blk.1.attn_norm.weight - [ 7168,     1,     1,     1], type =    f32, size =    0.027 MB
[  17/1147]                blk.1.ffn_down.weight - [18432,  7168,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.1.ffn_down.weight
converting to q8_0_r8 .. size =   252.00 MiB ->   133.88 MiB
[  18/1147]                blk.1.ffn_gate.weight - [ 7168, 18432,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.1.ffn_gate.weight
converting to q8_0_r8 .. size =   252.00 MiB ->   133.88 MiB
[  19/1147]                  blk.1.ffn_up.weight - [ 7168, 18432,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.1.ffn_up.weight
converting to q8_0_r8 .. size =   252.00 MiB ->   133.88 MiB
[  20/1147]                blk.1.ffn_norm.weight - [ 7168,     1,     1,     1], type =    f32, size =    0.027 MB
[  21/1147]          blk.1.attn_kv_a_norm.weight - [  512,     1,     1,     1], type =    f32, size =    0.002 MB
[  22/1147]           blk.1.attn_kv_a_mqa.weight - [ 7168,   576,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.1.attn_kv_a_mqa.weight
converting to q8_0_r8 .. size =     7.88 MiB ->     4.18 MiB
[  23/1147]               blk.1.attn_kv_b.weight - [  512, 32768,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.1.attn_kv_b.weight
converting to q8_0_r8 .. size =    32.00 MiB ->    17.00 MiB
[  24/1147]                blk.1.attn_k_b.weight - [  128, 65536,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.1.attn_k_b.weight

====== llama_model_quantize_internal: did not find weights for blk.1.attn_k_b.weight
converting to q8_0_r8 .. size =    16.00 MiB ->     8.50 MiB
[  25/1147]                blk.1.attn_v_b.weight - [  512, 16384,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.1.attn_v_b.weight
converting to q8_0_r8 .. size =    16.00 MiB ->     8.50 MiB
[  26/1147]             blk.1.attn_output.weight - [16384,  7168,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.1.attn_output.weight
converting to q8_0_r8 .. size =   224.00 MiB ->   119.00 MiB
[  27/1147]           blk.1.attn_q_a_norm.weight - [ 1536,     1,     1,     1], type =    f32, size =    0.006 MB
[  28/1147]                blk.1.attn_q_a.weight - [ 7168,  1536,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.1.attn_q_a.weight
converting to q8_0_r8 .. size =    21.00 MiB ->    11.16 MiB
[  29/1147]                blk.1.attn_q_b.weight - [ 1536, 24576,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.1.attn_q_b.weight
converting to q8_0_r8 .. size =    72.00 MiB ->    38.25 MiB
[  30/1147]               blk.2.attn_norm.weight - [ 7168,     1,     1,     1], type =    f32, size =    0.027 MB
[  31/1147]                blk.2.ffn_down.weight - [18432,  7168,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.2.ffn_down.weight
converting to q8_0_r8 .. size =   252.00 MiB ->   133.88 MiB
[  32/1147]                blk.2.ffn_gate.weight - [ 7168, 18432,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.2.ffn_gate.weight
converting to q8_0_r8 .. size =   252.00 MiB ->   133.88 MiB
[  33/1147]                  blk.2.ffn_up.weight - [ 7168, 18432,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.2.ffn_up.weight
converting to q8_0_r8 .. size =   252.00 MiB ->   133.88 MiB
[  34/1147]                blk.2.ffn_norm.weight - [ 7168,     1,     1,     1], type =    f32, size =    0.027 MB
[  35/1147]          blk.2.attn_kv_a_norm.weight - [  512,     1,     1,     1], type =    f32, size =    0.002 MB
[  36/1147]           blk.2.attn_kv_a_mqa.weight - [ 7168,   576,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.2.attn_kv_a_mqa.weight
converting to q8_0_r8 .. size =     7.88 MiB ->     4.18 MiB
[  37/1147]               blk.2.attn_kv_b.weight - [  512, 32768,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.2.attn_kv_b.weight
converting to q8_0_r8 .. size =    32.00 MiB ->    17.00 MiB
[  38/1147]                blk.2.attn_k_b.weight - [  128, 65536,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.2.attn_k_b.weight

====== llama_model_quantize_internal: did not find weights for blk.2.attn_k_b.weight
converting to q8_0_r8 .. size =    16.00 MiB ->     8.50 MiB
[  39/1147]                blk.2.attn_v_b.weight - [  512, 16384,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.2.attn_v_b.weight
converting to q8_0_r8 .. size =    16.00 MiB ->     8.50 MiB
[  40/1147]             blk.2.attn_output.weight - [16384,  7168,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.2.attn_output.weight
converting to q8_0_r8 .. size =   224.00 MiB ->   119.00 MiB
[  41/1147]           blk.2.attn_q_a_norm.weight - [ 1536,     1,     1,     1], type =    f32, size =    0.006 MB
[  42/1147]                blk.2.attn_q_a.weight - [ 7168,  1536,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.2.attn_q_a.weight
converting to q8_0_r8 .. size =    21.00 MiB ->    11.16 MiB
[  43/1147]                blk.2.attn_q_b.weight - [ 1536, 24576,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.2.attn_q_b.weight
converting to q8_0_r8 .. size =    72.00 MiB ->    38.25 MiB
[  44/1147]               blk.3.exp_probs_b.bias - [  256,     1,     1,     1], type =    f32, size =    0.001 MB
[  45/1147]            blk.3.ffn_gate_inp.weight - [ 7168,   256,     1,     1], type =    f32, size =    7.000 MB
[  46/1147]          blk.3.ffn_down_shexp.weight - [ 2048,  7168,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.3.ffn_down_shexp.weight
converting to q8_0_r8 .. size =    28.00 MiB ->    14.88 MiB
[  47/1147]          blk.3.ffn_gate_shexp.weight - [ 7168,  2048,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.3.ffn_gate_shexp.weight
converting to q8_0_r8 .. size =    28.00 MiB ->    14.88 MiB
[  48/1147]            blk.3.ffn_up_shexp.weight - [ 7168,  2048,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.3.ffn_up_shexp.weight
converting to q8_0_r8 .. size =    28.00 MiB ->    14.88 MiB
[  49/1147]          blk.3.attn_kv_a_norm.weight - [  512,     1,     1,     1], type =    f32, size =    0.002 MB
[  50/1147]           blk.3.attn_kv_a_mqa.weight - [ 7168,   576,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.3.attn_kv_a_mqa.weight
converting to q8_0_r8 .. size =     7.88 MiB ->     4.18 MiB
[  51/1147]               blk.3.attn_kv_b.weight - [  512, 32768,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.3.attn_kv_b.weight
converting to q8_0_r8 .. size =    32.00 MiB ->    17.00 MiB
[  52/1147]                blk.3.attn_k_b.weight - [  128, 65536,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.3.attn_k_b.weight

====== llama_model_quantize_internal: did not find weights for blk.3.attn_k_b.weight
converting to q8_0_r8 .. size =    16.00 MiB ->     8.50 MiB
[  53/1147]                blk.3.attn_v_b.weight - [  512, 16384,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.3.attn_v_b.weight
converting to q8_0_r8 .. size =    16.00 MiB ->     8.50 MiB
[  54/1147]             blk.3.attn_output.weight - [16384,  7168,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.3.attn_output.weight
converting to q8_0_r8 .. size =   224.00 MiB ->   119.00 MiB
[  55/1147]           blk.3.attn_q_a_norm.weight - [ 1536,     1,     1,     1], type =    f32, size =    0.006 MB
[  56/1147]                blk.3.attn_q_a.weight - [ 7168,  1536,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.3.attn_q_a.weight
converting to q8_0_r8 .. size =    21.00 MiB ->    11.16 MiB
[  57/1147]                blk.3.attn_q_b.weight - [ 1536, 24576,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.3.attn_q_b.weight
converting to q8_0_r8 .. size =    72.00 MiB ->    38.25 MiB
[  58/1147]               blk.3.attn_norm.weight - [ 7168,     1,     1,     1], type =    f32, size =    0.027 MB
[  59/1147]           blk.3.ffn_down_exps.weight - [ 2048,  7168,   256,     1], type =   bf16, Using custom type iq5_k_r4 for tensor blk.3.ffn_down_exps.weight
converting to iq5_k_r4 .. size =  7168.00 MiB ->  2464.00 MiB
[  60/1147]           blk.3.ffn_gate_exps.weight - [ 7168,  2048,   256,     1], type =   bf16, Using custom type iq4_k_r4 for tensor blk.3.ffn_gate_exps.weight
converting to iq4_k_r4 .. size =  7168.00 MiB ->  2016.00 MiB
[  61/1147]             blk.3.ffn_up_exps.weight - [ 7168,  2048,   256,     1], type =   bf16, Using custom type iq4_k_r4 for tensor blk.3.ffn_up_exps.weight
converting to iq4_k_r4 .. size =  7168.00 MiB ->  2016.00 MiB
[  62/1147]                blk.3.ffn_norm.weight - [ 7168,     1,     1,     1], type =    f32, size =    0.027 MB
[  63/1147]               blk.4.exp_probs_b.bias - [  256,     1,     1,     1], type =    f32, size =    0.001 MB
[  64/1147]            blk.4.ffn_gate_inp.weight - [ 7168,   256,     1,     1], type =    f32, size =    7.000 MB
[  65/1147]          blk.4.ffn_down_shexp.weight - [ 2048,  7168,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.4.ffn_down_shexp.weight
converting to q8_0_r8 .. size =    28.00 MiB ->    14.88 MiB
[  66/1147]          blk.4.ffn_gate_shexp.weight - [ 7168,  2048,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.4.ffn_gate_shexp.weight
converting to q8_0_r8 .. size =    28.00 MiB ->    14.88 MiB
[  67/1147]            blk.4.ffn_up_shexp.weight - [ 7168,  2048,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.4.ffn_up_shexp.weight
converting to q8_0_r8 .. size =    28.00 MiB ->    14.88 MiB
[  68/1147]          blk.4.attn_kv_a_norm.weight - [  512,     1,     1,     1], type =    f32, size =    0.002 MB
[  69/1147]           blk.4.attn_kv_a_mqa.weight - [ 7168,   576,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.4.attn_kv_a_mqa.weight
converting to q8_0_r8 .. size =     7.88 MiB ->     4.18 MiB
[  70/1147]               blk.4.attn_kv_b.weight - [  512, 32768,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.4.attn_kv_b.weight
converting to q8_0_r8 .. size =    32.00 MiB ->    17.00 MiB
[  71/1147]                blk.4.attn_k_b.weight - [  128, 65536,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.4.attn_k_b.weight

====== llama_model_quantize_internal: did not find weights for blk.4.attn_k_b.weight
converting to q8_0_r8 .. size =    16.00 MiB ->     8.50 MiB
[  72/1147]                blk.4.attn_v_b.weight - [  512, 16384,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.4.attn_v_b.weight
converting to q8_0_r8 .. size =    16.00 MiB ->     8.50 MiB
[  73/1147]             blk.4.attn_output.weight - [16384,  7168,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.4.attn_output.weight
converting to q8_0_r8 .. size =   224.00 MiB ->   119.00 MiB
[  74/1147]           blk.4.attn_q_a_norm.weight - [ 1536,     1,     1,     1], type =    f32, size =    0.006 MB
[  75/1147]                blk.4.attn_q_a.weight - [ 7168,  1536,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.4.attn_q_a.weight
converting to q8_0_r8 .. size =    21.00 MiB ->    11.16 MiB
[  76/1147]                blk.4.attn_q_b.weight - [ 1536, 24576,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.4.attn_q_b.weight
converting to q8_0_r8 .. size =    72.00 MiB ->    38.25 MiB
[  77/1147]               blk.4.attn_norm.weight - [ 7168,     1,     1,     1], type =    f32, size =    0.027 MB
[  78/1147]           blk.4.ffn_down_exps.weight - [ 2048,  7168,   256,     1], type =   bf16, Using custom type iq5_k_r4 for tensor blk.4.ffn_down_exps.weight
converting to iq5_k_r4 .. size =  7168.00 MiB ->  2464.00 MiB
[  79/1147]           blk.4.ffn_gate_exps.weight - [ 7168,  2048,   256,     1], type =   bf16, Using custom type iq4_k_r4 for tensor blk.4.ffn_gate_exps.weight
converting to iq4_k_r4 .. size =  7168.00 MiB ->  2016.00 MiB
[  80/1147]             blk.4.ffn_up_exps.weight - [ 7168,  2048,   256,     1], type =   bf16, Using custom type iq4_k_r4 for tensor blk.4.ffn_up_exps.weight
converting to iq4_k_r4 .. size =  7168.00 MiB ->  2016.00 MiB
[  81/1147]                blk.4.ffn_norm.weight - [ 7168,     1,     1,     1], type =    f32, size =    0.027 MB
[  82/1147]          blk.5.attn_kv_a_norm.weight - [  512,     1,     1,     1], type =    f32, size =    0.002 MB
[  83/1147]           blk.5.attn_kv_a_mqa.weight - [ 7168,   576,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.5.attn_kv_a_mqa.weight
converting to q8_0_r8 .. size =     7.88 MiB ->     4.18 MiB
[  84/1147]               blk.5.attn_kv_b.weight - [  512, 32768,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.5.attn_kv_b.weight
converting to q8_0_r8 .. size =    32.00 MiB ->    17.00 MiB
[  85/1147]                blk.5.attn_k_b.weight - [  128, 65536,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.5.attn_k_b.weight

====== llama_model_quantize_internal: did not find weights for blk.5.attn_k_b.weight

# SNIP text was too long for github issues

====== llama_model_quantize_internal: did not find weights for blk.59.attn_k_b.weight
converting to q8_0_r8 .. size =    16.00 MiB ->     8.50 MiB
[1117/1147]               blk.59.attn_v_b.weight - [  512, 16384,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.59.attn_v_b.weight
converting to q8_0_r8 .. size =    16.00 MiB ->     8.50 MiB
[1118/1147]            blk.59.attn_output.weight - [16384,  7168,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.59.attn_output.weight
converting to q8_0_r8 .. size =   224.00 MiB ->   119.00 MiB
[1119/1147]          blk.59.attn_q_a_norm.weight - [ 1536,     1,     1,     1], type =    f32, size =    0.006 MB
[1120/1147]               blk.59.attn_q_a.weight - [ 7168,  1536,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.59.attn_q_a.weight
converting to q8_0_r8 .. size =    21.00 MiB ->    11.16 MiB
[1121/1147]               blk.59.attn_q_b.weight - [ 1536, 24576,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.59.attn_q_b.weight
converting to q8_0_r8 .. size =    72.00 MiB ->    38.25 MiB
[1122/1147]              blk.59.attn_norm.weight - [ 7168,     1,     1,     1], type =    f32, size =    0.027 MB
[1123/1147]          blk.59.ffn_down_exps.weight - [ 2048,  7168,   256,     1], type =   bf16, Using custom type iq5_k_r4 for tensor blk.59.ffn_down_exps.weight
converting to iq5_k_r4 .. size =  7168.00 MiB ->  2464.00 MiB
[1124/1147]          blk.59.ffn_gate_exps.weight - [ 7168,  2048,   256,     1], type =   bf16, Using custom type iq4_k_r4 for tensor blk.59.ffn_gate_exps.weight
converting to iq4_k_r4 .. size =  7168.00 MiB ->  2016.00 MiB
[1125/1147]            blk.59.ffn_up_exps.weight - [ 7168,  2048,   256,     1], type =   bf16, Using custom type iq4_k_r4 for tensor blk.59.ffn_up_exps.weight
converting to iq4_k_r4 .. size =  7168.00 MiB ->  2016.00 MiB
[1126/1147]               blk.59.ffn_norm.weight - [ 7168,     1,     1,     1], type =    f32, size =    0.027 MB
[1127/1147]              blk.60.exp_probs_b.bias - [  256,     1,     1,     1], type =    f32, size =    0.001 MB
[1128/1147]           blk.60.ffn_gate_inp.weight - [ 7168,   256,     1,     1], type =    f32, size =    7.000 MB
[1129/1147]         blk.60.ffn_down_shexp.weight - [ 2048,  7168,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.60.ffn_down_shexp.weight
converting to q8_0_r8 .. size =    28.00 MiB ->    14.88 MiB
[1130/1147]         blk.60.ffn_gate_shexp.weight - [ 7168,  2048,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.60.ffn_gate_shexp.weight
converting to q8_0_r8 .. size =    28.00 MiB ->    14.88 MiB
[1131/1147]           blk.60.ffn_up_shexp.weight - [ 7168,  2048,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.60.ffn_up_shexp.weight
converting to q8_0_r8 .. size =    28.00 MiB ->    14.88 MiB
[1132/1147]         blk.60.attn_kv_a_norm.weight - [  512,     1,     1,     1], type =    f32, size =    0.002 MB
[1133/1147]          blk.60.attn_kv_a_mqa.weight - [ 7168,   576,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.60.attn_kv_a_mqa.weight
converting to q8_0_r8 .. size =     7.88 MiB ->     4.18 MiB
[1134/1147]              blk.60.attn_kv_b.weight - [  512, 32768,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.60.attn_kv_b.weight
converting to q8_0_r8 .. size =    32.00 MiB ->    17.00 MiB
[1135/1147]               blk.60.attn_k_b.weight - [  128, 65536,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.60.attn_k_b.weight

====== llama_model_quantize_internal: did not find weights for blk.60.attn_k_b.weight
converting to q8_0_r8 .. size =    16.00 MiB ->     8.50 MiB
[1136/1147]               blk.60.attn_v_b.weight - [  512, 16384,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.60.attn_v_b.weight
converting to q8_0_r8 .. size =    16.00 MiB ->     8.50 MiB
[1137/1147]            blk.60.attn_output.weight - [16384,  7168,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.60.attn_output.weight
converting to q8_0_r8 .. size =   224.00 MiB ->   119.00 MiB
[1138/1147]          blk.60.attn_q_a_norm.weight - [ 1536,     1,     1,     1], type =    f32, size =    0.006 MB
[1139/1147]               blk.60.attn_q_a.weight - [ 7168,  1536,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.60.attn_q_a.weight
converting to q8_0_r8 .. size =    21.00 MiB ->    11.16 MiB
[1140/1147]               blk.60.attn_q_b.weight - [ 1536, 24576,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor blk.60.attn_q_b.weight
converting to q8_0_r8 .. size =    72.00 MiB ->    38.25 MiB
[1141/1147]                        output.weight - [ 7168, 129280,     1,     1], type =   bf16, Using custom type q8_0_r8 for tensor output.weight

====== llama_model_quantize_internal: did not find weights for output.weight
converting to q8_0_r8 .. size =  1767.50 MiB ->   938.98 MiB
[1142/1147]              blk.60.attn_norm.weight - [ 7168,     1,     1,     1], type =    f32, size =    0.027 MB
[1143/1147]          blk.60.ffn_down_exps.weight - [ 2048,  7168,   256,     1], type =   bf16, Using custom type iq5_k_r4 for tensor blk.60.ffn_down_exps.weight
converting to iq5_k_r4 .. size =  7168.00 MiB ->  2464.00 MiB
[1144/1147]          blk.60.ffn_gate_exps.weight - [ 7168,  2048,   256,     1], type =   bf16, Using custom type iq4_k_r4 for tensor blk.60.ffn_gate_exps.weight
converting to iq4_k_r4 .. size =  7168.00 MiB ->  2016.00 MiB
[1145/1147]            blk.60.ffn_up_exps.weight - [ 7168,  2048,   256,     1], type =   bf16, Using custom type iq4_k_r4 for tensor blk.60.ffn_up_exps.weight
converting to iq4_k_r4 .. size =  7168.00 MiB ->  2016.00 MiB
[1146/1147]               blk.60.ffn_norm.weight - [ 7168,     1,     1,     1], type =    f32, size =    0.027 MB
[1147/1147]                   output_norm.weight - [ 7168,     1,     1,     1], type =    f32, size =    0.027 MB
llama_model_quantize_internal: model size  = 1282038.27 MB
llama_model_quantize_internal: quant size  = 395450.97 MB

main: quantize time = 5308904.06 ms
main:    total time = 5308904.06 ms
Perplexity Procedure

Output Logs

$ numactl -N 1 -m 1 \
./build/bin/llama-perplexity \
    --model /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-CPU-IQ4_K_R4.gguf \
    -ctk q8_0 \
    -mla 3 -fa \
    -amb 512 \
    -fmoe \
    --ctx-size 512 \
    --ubatch-size 512 \
    -f wiki.test.raw \
    --seed 1337 \
    --numa numactl \
    --threads 128

main: build = 3613 (4819257c)
main: built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
main: seed  = 1337
llama_model_loader: loaded meta data with 50 key-value pairs and 1147 tensors from /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-CPU-
IQ4_K_R4.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = deepseek2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = DeepSeek V3 0324
llama_model_loader: - kv   3:                            general.version str              = V3-0324
llama_model_loader: - kv   4:                           general.basename str              = DeepSeek
llama_model_loader: - kv   5:                         general.size_label str              = 256x21B
llama_model_loader: - kv   6:                            general.license str              = mit
llama_model_loader: - kv   7:                      deepseek2.block_count u32              = 61
llama_model_loader: - kv   8:                   deepseek2.context_length u32              = 163840
llama_model_loader: - kv   9:                 deepseek2.embedding_length u32              = 7168
llama_model_loader: - kv  10:              deepseek2.feed_forward_length u32              = 18432
llama_model_loader: - kv  11:             deepseek2.attention.head_count u32              = 128
llama_model_loader: - kv  12:          deepseek2.attention.head_count_kv u32              = 128
llama_model_loader: - kv  13:                   deepseek2.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  14: deepseek2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  15:                deepseek2.expert_used_count u32              = 8
llama_model_loader: - kv  16:                          general.file_type u32              = 340
llama_model_loader: - kv  17:        deepseek2.leading_dense_block_count u32              = 3
llama_model_loader: - kv  18:                       deepseek2.vocab_size u32              = 129280
llama_model_loader: - kv  19:            deepseek2.attention.q_lora_rank u32              = 1536
llama_model_loader: - kv  20:           deepseek2.attention.kv_lora_rank u32              = 512
llama_model_loader: - kv  21:             deepseek2.attention.key_length u32              = 192
llama_model_loader: - kv  22:           deepseek2.attention.value_length u32              = 128
llama_model_loader: - kv  23:       deepseek2.expert_feed_forward_length u32              = 2048
llama_model_loader: - kv  24:                     deepseek2.expert_count u32              = 256
llama_model_loader: - kv  25:              deepseek2.expert_shared_count u32              = 1
llama_model_loader: - kv  26:             deepseek2.expert_weights_scale f32              = 2.500000
llama_model_loader: - kv  27:              deepseek2.expert_weights_norm bool             = true
llama_model_loader: - kv  28:               deepseek2.expert_gating_func u32              = 2
llama_model_loader: - kv  29:             deepseek2.rope.dimension_count u32              = 64
llama_model_loader: - kv  30:                deepseek2.rope.scaling.type str              = yarn
llama_model_loader: - kv  31:              deepseek2.rope.scaling.factor f32              = 40.000000
llama_model_loader: - kv  32: deepseek2.rope.scaling.original_context_length u32              = 4096
llama_model_loader: - kv  33: deepseek2.rope.scaling.yarn_log_multiplier f32              = 0.100000
llama_model_loader: - kv  34:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  35:                         tokenizer.ggml.pre str              = deepseek-v3
llama_model_loader: - kv  36:                      tokenizer.ggml.tokens arr[str,129280]  = ["
llama_model_loader: - kv  37:                  tokenizer.ggml.token_type arr[i32,129280]  = [3
llama_model_loader: - kv  38:                      tokenizer.ggml.merges arr[str,127741]  = ["
llama_model_loader: - kv  39:                tokenizer.ggml.bos_token_id u32              = 0
llama_model_loader: - kv  40:                tokenizer.ggml.eos_token_id u32              = 1
llama_model_loader: - kv  41:            tokenizer.ggml.padding_token_id u32              = 1
llama_model_loader: - kv  42:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  43:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  44:                    tokenizer.chat_template str              = {% if not add_generation_prompt is de...
llama_model_loader: - kv  45:               general.quantization_version u32              = 2
llama_model_loader: - kv  46:                      quantize.imatrix.file str              = /mnt/raid/models/ubergarm/DeepSeek-V3...
llama_model_loader: - kv  47:                   quantize.imatrix.dataset str              = calibration_data_v5_rc.txt
llama_model_loader: - kv  48:             quantize.imatrix.entries_count i32              = 720
llama_model_loader: - kv  49:              quantize.imatrix.chunks_count i32              = 213
llama_model_loader: - type  f32:  361 tensors
llama_model_loader: - type q8_0_r8:  612 tensors
llama_model_loader: - type iq4_k_r4:  116 tensors
llama_model_loader: - type iq5_k_r4:   58 tensors
llm_load_vocab: special tokens cache size = 818
llm_load_vocab: token to piece cache size = 0.8223 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = deepseek2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 129280
llm_load_print_meta: n_merges         = 127741
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 163840
llm_load_print_meta: n_embd           = 7168
llm_load_print_meta: n_layer          = 61
llm_load_print_meta: n_head           = 128
llm_load_print_meta: n_head_kv        = 128
llm_load_print_meta: n_rot            = 64
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 192
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 24576
llm_load_print_meta: n_embd_v_gqa     = 16384
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 18432
llm_load_print_meta: n_expert         = 256
llm_load_print_meta: n_expert_used    = 8
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = yarn
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 0.025
llm_load_print_meta: n_ctx_orig_yarn  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 671B
llm_load_print_meta: model ftype      = IQ4_K_R4 - 4.5 bpw
llm_load_print_meta: model params     = 672.050 B
llm_load_print_meta: model size       = 386.183 GiB (4.936 BPW)
llm_load_print_meta: repeating layers = 384.349 GiB (4.926 BPW, 670.196 B parameters)
llm_load_print_meta: general.name     = DeepSeek V3 0324
llm_load_print_meta: BOS token        = 0 '<|begin▁of▁sentence|>'
llm_load_print_meta: EOS token        = 1 '<|end▁of▁sentence|>'
llm_load_print_meta: PAD token        = 1 '<|end▁of▁sentence|>'
llm_load_print_meta: LF token         = 131 'Ä'
llm_load_print_meta: max token length = 256
llm_load_print_meta: n_layer_dense_lead   = 3
llm_load_print_meta: n_lora_q             = 1536
llm_load_print_meta: n_lora_kv            = 512
llm_load_print_meta: n_ff_exp             = 2048
llm_load_print_meta: n_expert_shared      = 1
llm_load_print_meta: expert_weights_scale = 2.5
llm_load_print_meta: expert_weights_norm  = 1
llm_load_print_meta: expert_gating_func   = sigmoid
llm_load_print_meta: rope_yarn_log_mul    = 0.1000
llm_load_tensors: ggml ctx size =    0.47 MiB
llm_load_tensors:        CPU buffer size = 395450.97 MiB
....................................................................................................
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn   = 3
llama_new_context_with_model: attn_max_b = 512
llama_new_context_with_model: fused_moe  = 1
llama_new_context_with_model: ser        = -1, 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 0.025
llama_kv_cache_init: layer 0: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 1: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 2: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 3: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 4: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 5: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 6: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 7: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 8: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 9: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 10: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 11: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 12: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 13: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 14: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 15: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 16: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 17: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 18: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 19: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 20: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 21: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 22: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 23: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 24: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 25: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 26: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 27: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 28: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 29: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 30: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 31: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 32: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 33: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 34: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 35: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 36: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 37: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 38: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 39: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 40: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 41: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 42: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 43: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 44: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 45: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 46: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 47: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 48: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 49: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 50: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 51: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 52: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 53: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 54: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 55: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 56: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 57: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 58: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 59: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 60: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init:        CPU KV buffer size =    72.91 MiB
llama_new_context_with_model: KV self size  =   72.91 MiB, c^KV (q8_0):   72.91 MiB, kv^T: not used
llama_new_context_with_model:        CPU  output buffer size =     1.97 MiB
llama_new_context_with_model:        CPU compute buffer size =   450.01 MiB
llama_new_context_with_model: graph nodes  = 3487
llama_new_context_with_model: graph splits = 1

system_info: n_threads = 128 / 512 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 |
NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE =
1 |
perplexity: tokenizing the input ..
perplexity: tokenization took 928.692 ms
perplexity: calculating perplexity over 561 chunks, n_ctx=512, batch_size=2048, n_seq=4
perplexity: 15.08 seconds per pass - ETA 35.23 minutes
[1]621042.4845,[2]480288.4154,[3]384849.5504,[4]411291.6749,[5]342382.0527,[6]347496.7446,[7]338598.0612,[8]338938.1630,[9]343341.0863,[10]329407.7871
,[11]328794.0950,[12]349036.2429,[13]339812.6162,[14]327127.2843,[15]318294.3349,[16]320629.0762,[17]318911.2283,[18]306946.2653,[19]320742.9747,[20]3
20520.4166,[21]323369.9752,[22]321108.7583,[23]320950.8245,[24]323537.1597,[25]313530.9380,[26]307858.8254,[27]305584.6174,[28]304930.6946,[29]319325.
7633,[30]316463.6020,[31]318028.8556,[32]323730.7568,[33]336376.3859,[34]338644.6368,[35]341295.5596,[36]346582.1772,[37]343638.6921,[38]346920.0126,[
39]346553.2755,[40]339975.1907,[41]338080.6482,[42]341607.0511,[43]342165.4351,[44]343495.4481,[45]341683.0497,[46]341841.5203,[47]341968.2578,[48]341
018.8794,[49]337906.3680,[50]340880.4017,[51]343264.7780,[52]341172.5260,[53]341895.8030,[54]342362.6716,[55]339077.7577,[56]337472.3629,[57]338597.41
79,[58]338840.4233,[59]340391.7068,[60]341329.9617,[61]338907.2644,[62]338654.8390,[63]340597.6581,[64]341464.0272,[65]339761.5866,[66]337473.1508,[67
]334628.2254,[68]335027.1919,[69]336085.7135,[70]334748.0318,[71]334310.4754,[72]332610.8172,[73]331121.5117,[74]331604.3876,[75]331320.1529,[76]33491
0.8814,[77]336051.4006,[78]335753.6115,[79]337362.5269,[80]335564.3466,[81]332456.8750,[82]331609.4385,[83]333316.4520,[84]335084.6156,[85]334711.4110
,[86]334160.7888,[87]332126.7278,[88]331597.7024,[89]331461.8908,[90]330703.9912,[91]331143.7667,[92]328566.8218,[93]327220.3991,[94]327306.2202,[95]3
28760.6069,[96]331831.1512,[97]331100.4377,[98]331676.2039,[99]331115.3237,[100]332922.5225,[101]330521.2050,[102]330638.9063,[103]330508.2943,[104]33
3336.3249,[105]332252.4134,[106]331511.8882,[107]331478.9005,[108]330800.7499,[109]331643.0452,[110]332295.2747,[111]331716.4016,[112]333145.4543,[113
]332446.6042,[114]332605.4088,[115]334144.7878,[116]334062.6775,[117]334795.9300,[118]335185.6388,[119]336442.8975,[120]336288.3524,[121]337854.3067,[
122]342121.8593,[123]342443.4687,[124]343659.0524,[125]344785.3775,[126]345809.3526,[127]347207.6305,[128]348210.4479,[129]349672.3288,[130]350221.461
2,[131]350215.0059,[132]352167.2450,[133]351660.6672,[134]353361.5754,[135]354848.8108,[136]353175.7897,[137]353870.5511,[138]355061.4101,[139]355874.
4197,[140]356669.3123,[141]355293.1474,[142]354584.2063,[143]353505.6443,[144]354011.7258,[145]352950.0290,[146]352775.3758,[147]350332.0398,[148]3489
19.1460,[149]348589.1782,[150]348457.2881,[151]347884.5859,[152]347551.9711,[153]346394.1977,[154]345076.4034,[155]342799.4862,[156]342481.4941,[157]3
42472.8007,[158]341437.5809,[159]341069.4855,[160]340176.4801,[161]340547.0153,[162]341245.8648,[163]340449.0528,[164]339162.6069,[165]339049.6867,[16
6]340108.0202,[167]338993.8220,[168]338633.1774,[169]337653.7408,[170]337330.2507,[171]337964.2748,[172]336817.5461,[173]335656.4557,[174]335356.9395,
[175]335636.9791,[176]336962.6238,[177]336571.5140,[178]336611.6326,[179]336169.1428,[180]337152.8681,[181]336928.3568,[182]337374.7017,[183]336574.88
30,[184]336549.1612,[185]336890.1861,[186]336270.8240,[187]336033.7314,[188]336260.7362,[189]336337.6063,[190]335905.2686,[191]335671.5326,[192]336063
.9825,[193]336254.3945,[194]336390.3271,[195]336058.7223,[196]336123.5871,[197]336272.6905,[198]336581.7609,[199]336125.9311,[200]336175.1478,[201]335
261.2004,[202]335722.4991,[203]335732.0036,[204]336010.6380,[205]336554.9746,[206]336870.3485,[207]337512.5650,[208]337800.7907,[209]337957.8198,[210]
339006.8855,[211]339536.3558,[212]339771.6654,[213]339820.9878,[214]340649.4873,[215]340871.1208,[216]341088.6222,[217]340871.9526,[218]340944.1487,[2
19]341612.6012,[220]342518.8541,[221]342988.1971,[222]342574.7840,[223]343481.4894,[224]343029.3821,[225]343295.2932,[226]343032.9993,[227]343704.6932
,[228]345175.9576,[229]345567.2666,[230]346984.2971,[231]347891.9790,[232]348421.3554,[233]347906.3728,[234]348105.3882,[235]347709.6448,[236]347865.7
097,[237]347051.5113,[238]347476.0560,[239]348607.8464,[240]347950.9243,[241]348175.2049,[242]348260.1216,[243]348118.1121,[244]349105.7627,[245]35034
3.6532,[246]351018.4541,[247]349972.1138,[248]349626.9985,[249]349815.8200,[250]349784.0491,[251]349044.6743,[252]348851.4149,[253]347922.8042,[254]34
7737.7496,[255]347553.6986,[256]347998.6214,[257]348681.4274,[258]348605.3748,[259]347746.3318,[260]347249.1009,[261]347208.6900,[262]346804.7642,[263
]346325.7216,[264]345906.9311,[265]345908.3860,[266]345701.0113,[267]345709.4001,[268]345912.5002,[269]346098.0048,[270]345980.1661,[271]345810.4070,[
272]345554.0991,[273]345337.1543,[274]344923.7055,[275]344460.3920,[276]343342.6230,[277]343576.3771,[278]342718.8707,[279]342988.6333,[280]343045.420
5,[281]342954.1471,[282]343121.6664,[283]343447.0750,[284]343345.1687,[285]343518.5285,[286]343098.9947,[287]342822.1719,[288]342853.3967,[289]343641.
2162,[290]343374.6100,[291]343746.9794,[292]343718.3872,[293]343928.4375,[294]344298.2272,[295]344357.2789,[296]344897.7471,[297]343889.5777,[298]3443
89.0557,[299]345317.8505,[300]344843.8735,[301]345089.1796,[302]345391.7513,[303]344981.9309,[304]345274.1943,[305]345361.9946,[306]344615.1515,[307]3
44191.7641,[308]344244.3699,[309]343919.6349,[310]344199.1177,[311]344405.9163,[312]344450.0979,[313]344439.8224,[314]344141.4730,[315]342825.3627,[31
6]341433.4296,[317]340663.0907,[318]339582.1865,[319]338423.3959,[320]338431.9492,[321]338115.6464,[322]337707.7252,[323]337509.5115,[324]337143.1945,
[325]336863.2449,[326]336823.7532,[327]336944.8010,[328]336631.8671,[329]335992.6150,[330]335818.9447,[331]335230.9186,[332]335293.0504,[333]334905.10
22,[334]335016.8497,[335]334882.2233,[336]335010.3878,[337]334898.4524,[338]334669.4391,[339]334527.0858,[340]334121.5989,[341]333836.9861,[342]334106
.1635,[343]334063.7962,[344]334203.4633,[345]334543.9787,[346]334077.9966,[347]334284.0650,[348]334445.7269,[349]334827.9118,[350]334821.3506,[351]334
479.8770,[352]334176.5657,[353]334025.4542,[354]333939.9035,[355]333898.6704,[356]333624.9149,[357]333237.7507,[358]333661.4850,[359]334098.6600,[360]
334318.0128,[361]334045.3073,[362]333919.0924,[363]333648.6163,[364]334117.8579,[365]334137.6652,[366]334344.9832,[367]334292.8768,[368]334416.0816,[3
69]334236.0430,[370]334155.9937,[371]333734.8777,[372]334073.4287,[373]333972.2325,[374]333610.6319,[375]333627.4234,[376]333967.3869,[377]334455.1315
,[378]334648.7305,[379]334723.9790,[380]334915.8106,[381]334783.0520,[382]334792.9807,[383]334292.3066,[384]334761.0592,[385]334650.0049,[386]334250.9
363,[387]334130.7030,[388]334962.6261,[389]335103.6648,[390]334964.4796,[391]335155.0150,[392]335258.2591,[393]335715.2107,[394]336216.3549,[395]33678
4.9280,[396]336825.6375,[397]336514.6311,[398]336291.0403,[399]335938.5148,[400]335934.1942,[401]336392.6242,[402]335974.0197,[403]336289.9238,[404]33
6379.4946,[405]336555.6353,[406]336369.9217,[407]336264.4100,[408]336306.2972,[409]336062.0189,[410]336218.9131,[411]335872.2278,[412]335754.9736,[413
]335586.0973,[414]335124.5066,[415]335378.1566,[416]335487.5042,[417]335712.7851,[418]335428.0417,[419]335734.1041,[420]336284.5707,[421]336296.1309,[
422]335716.1559,[423]335819.8443,[424]335746.8833,[425]335446.8556,[426]335455.4698,[427]335421.7328,[428]335308.4573,[429]335308.3605,[430]335634.427
1,[431]335941.7238,[432]335805.4835,[433]335864.1890,[434]335795.2289,[435]335790.3390,[436]336183.7092,[437]336053.6280,[438]336412.7182,[439]336779.
1893,[440]336638.0088,[441]336696.3587,[442]336693.5864,[443]336947.3901,[444]337364.4074,[445]337188.6797,[446]336960.3097,[447]336982.3581,[448]3367
40.4896,[449]336800.7335,[450]337456.5018,[451]337628.6795,[452]338075.0179,[453]338217.9506,[454]338563.8328,[455]338449.4376,[456]338244.9696,[457]3
38254.9905,[458]337899.0490,[459]338065.0851,[460]338084.4375,[461]338013.8557,[462]337774.4167,[463]338030.2594,[464]337997.7621,[465]338313.0132,[46
6]338480.3486,[467]338553.1094,[468]338698.8431,[469]338961.8873,[470]339099.5448,[471]339529.5247,[472]339518.9106,[473]339533.8010,[474]339280.8227,
[475]339337.3000,[476]339614.2696,[477]339436.1779,[478]339499.3813,[479]339569.9636,[480]339304.3727,[481]339458.5688,[482]339531.7829,[483]339698.45
70,[484]339156.1393,[485]339477.7685,[486]340238.3424,[487]340379.7815,[488]340655.9210,[489]340516.3203,[490]340570.0327,[491]340506.7411,[492]340278
.8962,[493]340258.7227,[494]340450.1686,[495]339995.1085,[496]340057.2055,[497]340209.0422,[498]339943.5230,[499]339784.5338,[500]339990.5147,[501]339
970.8131,[502]340371.5679,[503]340059.3617,[504]339792.6366,[505]339453.2254,[506]339424.0224,[507]339627.8620,[508]339683.1626,[509]339688.5786,[510]
339971.3743,[511]340134.1403,[512]340558.5657,[513]340734.9633,[514]341007.3962,[515]341043.8739,[516]341339.0372,[517]341604.4826,[518]341228.6644,[5
19]340909.3084,[520]340917.5889,[521]340871.2405,[522]340629.4603,[523]340600.1478,[524]340494.6514,[525]339985.5894,[526]339798.1336,[527]339423.1168
,[528]339574.7999,[529]338999.3788,[530]338866.6454,[531]339064.2290,[532]338175.7611,[533]338193.8181,[534]338591.1751,[535]338794.1938,[536]338815.3
925,[537]338854.7276,[538]338997.8122,[539]339560.6960,[540]339563.1839,[541]339606.7486,[542]339558.3348,[543]339493.1708,[544]339729.4373,[545]34020
8.8763,[546]340231.7345,[547]340359.0196,[548]340906.6126,[549]341063.1162,[550]341158.9496,[551]341645.1513,[552]341690.2990,[553]341566.8309,[554]34
1969.4067,[555]341819.3313,[556]341737.7033,[557]341893.9760,[558]341486.6305,[559]341186.3327,[560]340936.6909,[561]340925.0560,
llama_print_timings:        load time =    2238.45 ms
llama_print_timings:      sample time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time = 2214337.48 ms / 287232 tokens (    7.71 ms per token,   129.71 tokens per second)
llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time = 2264778.41 ms / 287233 tokens

Final estimate: PPL = 340925.0560 +/- 2519.12041
llama-server response to chat client looks wrong

I tried various combinations of server configs and all yielded same wrong looking responses in client.

Start Server

#### First attempt
numactl -N 0 -m 0 \
./build/bin/llama-server \
    --model /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-CPU-IQ4_K_R4.gguf \
    --alias ubergarm/DeepSeek-V3-0324-CPU-IQ4_K_R4 \
    --ctx-size 8192 \
    -ctk q8_0 \
    -mla 3 -fa \ # also tried -mla 2
    -amb 2048 \
    -fmoe \
    --temp 0.3 \
    --parallel 1 \
    --threads 128 \
    --numa numactl \
    --host 127.0.0.1 \
    --port 8080

#### Second attempt
numactl -N 0 -m 0 \
./build/bin/llama-server \
    --model /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-CPU-IQ4_K_R4.gguf \
    --alias ubergarm/DeepSeek-V3-0324-CPU-IQ4_K_R4 \
    --ctx-size 8192 \
    --parallel 1 \
    --threads 128 \
    --numa numactl \
    --host 127.0.0.1 \
    --port 8080

Start Client

$ python dchat.py
Input prompt then press Ctrl+D twice (or once on empty line) to send.
Ctrl+C to cancel response or twice to exit.

>>> User:

Count from 1 to 10 in French.

>>> Assistant:

AlrightAlrightAlrightAlright
>>> User:

^C^C

Exiting...

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions