Quick-start Guide coming over from llama.cpp and ktransformers! #258

ubergarm · 2025-03-14T20:30:15Z

ubergarm
Mar 14, 2025

`ik_llama.cpp`

Last Updated: Tue May 13 03:52:20 PM EDT 2025 (still needs more updates, can't keep up, check through comments below)

NEW: Two new custom quants great for CPU+GPU or CPU only inferencing fitting 32k+ context in under 24GB VRAM here on huggingface ubergarm/DeepSeek-V3-0324-GGUF! or start out with the quant you already have to kick the tires on ik_llama.cpp.

tl;dr;

ik_llama.cpp is a custom fork of llama.cpp introducing many interesting optimizations for MoE's like DeepSeek-R1 671B.

The new SOTA quant types can repack your existing GGUFs on the fly or you can roll your own to maximize quality and speed for your exact system VRAM and RAM availability.

I highly recommend you give ik_llama.cpp a try especially for CUDA+CPU or pure CPU inferencing. All the very similar ergonmics as vanilla llama-server that you already know and love.

64k context in under 24GB VRAM with over 15 tok/sec on a ThreadRipper Pro 24x core with 256GB RAM with single GPU.
Gaming rig 9950X + 96GB RAM + 3090TI 24GB VRAM + NVMe for over 4 toks/sec!
Fastest available implementation for DeepSeek-R1 671B on pure CPU dual socket Intel 6890P in my testing.

Install

# Install build dependencies and cuda toolkit as needed

# Clone
git clone https://github.com/ikawrakow/ik_llama.cpp
cd ik_llama.cpp

# Configure CUDA+CPU Backend
cmake -B ./build -DGGML_CUDA=ON -DGGML_BLAS=OFF

# *or* Configure CPU Only Backend
cmake -B ./build -DGGML_CUDA=OFF -DGGML_BLAS=OFF

# Build
cmake --build ./build --config Release -j $(nproc)

# Confirm
./build/bin/llama-server --version
version: 3597 (68a5b604)

Features

# Flash MLA & FlashMLA-2 & Flash Attention
# https://github.com/ikawrakow/ik_llama.cpp/pull/240
# https://github.com/ikawrakow/ik_llama.cpp/pull/253
#  -fa, --flash-attn <0|1>  (default: 0) # (for both CPU and CUDA)
#  -mla, --mla-attn <0|1|2|3> (default: 0) # -mla 1 for CPU only, -mla 2 for both CPU and CUDA, -mla 3 for CPU only
# *NOTE*: for llama-bench use `-fa 1`
# *UPDATE*: you can use `-mla 3` now for CPU+GPU with new PR
# tl;dr; generally use -mla 2 for CPU+GPU and use -mla 3 for CPU assuming your model architecture supports MLA
-mla 2 -fa

## On-the-Fly MLA Tensors
# To run existing R1 671B quants that are missing MLA tensors *without* the need to roll your own
# https://github.com/ikawrakow/ik_llama.cpp/pull/259
# This means you can run your existing unsloth quants with full FlashMLA-2 support without downloading another quant!!!

# KV Cache Quantization
# https://github.com/ikawrakow/ik_llama.cpp/pull/208
# https://github.com/ikawrakow/ik_llama.cpp/pull/240#issue-2890555894
#  -ctk,  --cache-type-k TYPE      KV cache data type for K (default: f16)
#  -ctv,  --cache-type-v TYPE      KV cache data type for V (default: f16)
-ctk q8_0

# Re-Use K*Q tensor compute buffer specify size
# (for both CPU and CUDA)
# https://github.com/ikawrakow/ik_llama.cpp/pull/237
# (i = Size in MiB)
# -amb, --attn-max-batch <i> (default: 0)
-amb 512 # 512 MiB compute buffer is a good for DeepSeek-R1 671B on a single <24GB VRAM GPU

# Fused MoE
# (For CUDA and maybe CPU when not using computing an imatrix?)
# https://github.com/ikawrakow/ik_llama.cpp/pull/229
# -fmoe, --fused-moe <0|1> (default: 0)
# *NOTE*: for llama-bench use `-fmoe 1`
-fmoe

# Override Model Tensor Buffers
# (For CUDA or possibly RPC or other GPU backends)
# https://github.com/ikawrakow/ik_llama.cpp/pull/232
# -ot, --override-tensor pattern (default: none)
# *NOTE*: this now works with `mmap()` so run models too big for your RAM!
-ot exps=CPU -ngl 99 # put the MoE experts on CPU and the rest in GPU for max speed on lowish VRAM
# if you  have multiple GPUs, this can get confusing, so take your time and start small and craft a regex for your setup

# Smart Expert Reduction
# https://github.com/ikawrakow/ik_llama.cpp/pull/239
# -ser, --smart-expert-reduction <i,f> (default: 0)
-ser 7,1 # or 6,1 or 5,1 for faster trading off quality for speed

# Run Time Repack
# Repack quants for improved performance for certain quants and hardware configs
# this disables mmap so need enough RAM to malloc all repacked quants (so pre-pack it yourself ahead of time with llama-quantize)
# (Optimize speed for repacked tensors on some CPUs - is good to use with hybrid GPU + CPU)
# https://github.com/ikawrakow/ik_llama.cpp/pull/147
# -rtr, --run-time-repack <0|1> (default: 0)
-rtr

# Offline Repacking Existing Quants
# Maximize quality, size, and speed
# Selecting quants for each tensor appropriate to your hybrid CPU/GPU configuration
# Remember repacked quants e.g. ending with `_R4` won't *run* on CUDA just sit there like expensive "RAM".
# https://github.com/ikawrakow/ik_llama.cpp/pull/274

# SoTA non-linear Quants with good CPU performance
# https://github.com/ikawrakow/ik_llama.cpp/pull/85
# ./bin/llama-quantize --help | grep non-linear
# Choose the repacked variants for CPU inferencing
# e.g. IQ2_K_R4 and friends for CPU tensors

# Supports both Explicit and Transparent Hugepages
# https://github.com/ikawrakow/ik_llama.cpp/pull/278#issuecomment-2746381515
# Pre-allocate Hugepages of 2MiB or 1GiB size to hold model weights
# or
# Configure system-wide THP support and confirm they are in use

Quick Start

Existing DeepSeek-R1 671B GGUF

Get 64k context with a single 24GB VRAM GPU using your existing unsloth quants like unsloth/DeepSeek-R1-UD-Q2-K_XL!

# CUDA GPU + CPU
# *NOTE*: This works on 68a5b604 but regression after that see GH ISSUE #271.
# *NOTE*: set --threads to number of physical cores
./build/bin/llama-server \
    --alias unsloth/DeepSeek-R1-Q2_K_R4 \
    --model /mnt/raid/models/unsloth/DeepSeek-R1-GGUF/DeepSeek-R1-UD-Q2_K_XL/DeepSeek-R1-UD-Q2_K_XL-00001-of-00005.gguf \
    -rtr \
    --ctx-size 65536 \
    -ctk q8_0 \
    -mla 2 -fa \
    -amb 512 \
    -fmoe \
    --n-gpu-layers 63 \
    --override-tensor exps=CPU \
    --parallel 1 \
    --threads 24 \
    --host 127.0.0.1 \
    --port 8080
.
.
.
llama_model_loader: - type  f32:  361 tensors
llama_model_loader: - type q2_K:  171 tensors
llama_model_loader: - type q3_K:    3 tensors
llama_model_loader: - type q4_K:  306 tensors
llama_model_loader: - type q6_K:  184 tensors
.
.
.
llm_load_tensors: offloading 61 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 62/62 layers to GPU
llm_load_tensors:        CPU buffer size = 205716.00 MiB
llm_load_tensors:  CUDA_Host buffer size =   497.11 MiB
llm_load_tensors:      CUDA0 buffer size =  9885.95 MiB
....................................................................................................
============ llm_load_tensors: need to compute 61 wk_b tensors
Computed blk.0.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.1.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.2.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
.
.
.
Computed blk.58.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.59.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.60.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
============ Repacked 174 tensors
llama_new_context_with_model: n_ctx      = 65536
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn   = 2
llama_new_context_with_model: attn_max_b = 512
llama_new_context_with_model: fused_moe  = 1
llama_new_context_with_model: ser        = -1, 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 0.025
.
.
.
llama_kv_cache_init:      CUDA0 KV buffer size =  2333.28 MiB
llama_new_context_with_model: KV self size  = 2333.25 MiB, c^KV (q8_0): 2333.25 MiB, kv^T: not used
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.99 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =  6081.00 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =   240.01 MiB
llama_new_context_with_model: graph nodes  = 13613
llama_new_context_with_model: graph splits = 118
.
.
.
INFO [           print_timings] prompt eval time     =    2078.89 ms /   190 tokens (   10.94 ms per token,    91.40 tokens per second) | tid="134221729001472" timestamp=1742422435 id_slot=0 id_task=753 t_prompt_processing=2078.885 n_prompt_tokens_processed=190 t_token=10.941500000000001 n_tokens_second=91.39514691769867
INFO [           print_timings] generation eval time =  107381.01 ms /  1557 runs   (   68.97 ms per token,    14.50 tokens per second) | tid="134221729001472" timestamp=1742422435 id_slot=0 id_task=753 t_token_generation=107381.013 n_decoded=1557 t_token=68.96661078998073 n_tokens_second=14.499770085052186
INFO [           print_timings]           total time =  109459.90 ms | tid="134221729001472" timestamp=1742422435 id_slot=0 id_task=753 t_prompt_processing=2078.885 t_token_generation=107381.013 t_total=109459.898

Custom Quant

I rolled my own custom quant to improve quality while still fitting 32k context in under 24GB VRAM. No need to use -rtr as this quant is already repacked so you can still use mmap() allowing you to run on systems without enough RAM by paging the disk cache. This quant has lower perplexity than UD-Q2_K_XL while only being slightly larger/slower. Good size for 256GB RAM systems where Q4_K_M doesn't fit.

# CUDA GPU + CPU
./build/bin/llama-server \
    --model /mnt/raid/models/ubergarm/DeepSeek-R1-GGUF/DeepSeek-R1-Q2_K_R4.gguf \
    --alias ubergarm/DeepSeek-R1-Q2_K_R4 \
    --ctx-size 32768 \
    -ctk q8_0 \
    -mla 2 -fa \
    -amb 512 \
    -fmoe \
    --n-gpu-layers 63 \
    --override-tensor exps=CPU \
    --parallel 1 \
    --threads 24 \
    --host 127.0.0.1 \
    --port 8080
.
.
.
llama_model_loader: - type  f32:  361 tensors
llama_model_loader: - type q8_0:  612 tensors
llama_model_loader: - type q2_k_r4:  116 tensors
llama_model_loader: - type q3_k_r4:   58 tensors
.
.
.
llm_load_tensors: offloading 61 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 62/62 layers to GPU
llm_load_tensors:        CPU buffer size = 225736.00 MiB
llm_load_tensors:        CPU buffer size =   938.98 MiB
llm_load_tensors:      CUDA0 buffer size = 17744.02 MiB
....................................................................................................
llama_new_context_with_model: n_ctx      = 32768
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn   = 2
llama_new_context_with_model: attn_max_b = 512
llama_new_context_with_model: fused_moe  = 1
llama_new_context_with_model: ser        = -1, 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 0.025
.
.
.
llama_kv_cache_init:      CUDA0 KV buffer size =  1166.65 MiB
llama_new_context_with_model: KV self size  = 1166.62 MiB, c^KV (q8_0): 1166.62 MiB, kv^T: not used
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.99 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =  3425.00 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =   176.01 MiB
llama_new_context_with_model: graph nodes  = 8245
llama_new_context_with_model: graph splits = 118

# CPU-only Example
# Configure BIOS for most RAM bandwidth in single NUMA node e.g.
#  * AMD Epyc to NPS1 (or experiment with NPS0 on dual socket system)
#  * Intel Xeon to SNC=Disable (no equivilent of NPS0 afaict)
# TODO: mention Explicit Huge Pages configuration and other Linux OS performance tweaks

$ numactl -N 0 -m 0 \
./build/bin/llama-server \
    --alias repack/DeepSeek-R1-Q4_K_R4 \
    --model /mnt/ai/models/unsloth/repack/DeepSeek-R1-Q4_K_R4.gguf \
    --ctx-size 32768 \
    -ctk q8_0 \
    -mla 3 -fa \
    -amb 1024 \
    -fmoe \
    --parallel 1 \
    --threads 128 \
    --numa numactl \
    --host 127.0.0.1 \
    --port 8080

Custom Quants

👇

Click here for how to make your own custom quants including repacking

# > The MLA attention tensors don't seem to quantize well at all and they are using 4bit for these, plus last time I checked they were only using 6 experts instead of 8.
# > I've got a custom llama.cpp quant with BF16 for all the _a and _b low-rank MLA attention tensors, Q6_K / Q5_K for all non-shared expert down_proj and up_proj/gate_proj respectively, and Q8_0 for everything else, and the story generation ability is on par with the official deepseek served models (and a lot better than many of the non-official versions being served on openrouter!).
# > Just changing the _b tensors for Q8_0 (and keeping everything else the same as above) starts to have really obvious negative effects on story generation, and using Q4_K or Q4_0 is severely degraded in comparison. I haven't rested this yet with the modified version of the MLA PR where I converted all the 3D batch matrix multiples to 2D though (this seemed to be a cause of some numerical problems too and might be the same reason for this). - jukofyork
# https://github.com/ikawrakow/ik_llama.cpp/pull/239#issuecomment-2708800842
# TODO: Show how to pack quants for speed and accuracy to fit into desired RAM size

# 0. Skip this and download an existing MLA supported quant e.g.
#https://huggingface.co/gghfez/DeepSeek-R1-11446-Q4_K
#https://huggingface.co/daydream-org/DeepSeek-R1-GGUF-11446/tree/main/DeepSeek-R1-Q3_K_M
#https://huggingface.co/gghfez/DeepSeek-R1-11446-Q2_K

# 1. Download original fp8 to target dir
uv venv ./venv --python 3.12 --python-preference=only-managed
source ./venv/bin/activate
uv pip install huggingface-hub hf_transfer huggingface-cli
HF_HUB_ENABLE_HF_TRANSFER=1 \
huggingface-cli \
    download \
    --resume-download \
    --local-dir ./ \
    deepseek-ai/DeepSeek-R1

# 2. Convert original fp8 to bf16
## Option A:
# Official DeepSeek pytorch implementation to convert fp8 to bf16 (may require newer/big GPU?):
# https://github.com/deepseek-ai/DeepSeek-V3/blob/main/inference/fp8_cast_bf16.py
# Then convert the output bf16 .safetensors to ~50GB splits GGUF format...
## Option B:
# Unofficial Triton CPU implementation (Converts fp8 safetensors directly to bf16 llama.cpp GGUF format):
# https://huggingface.co/daydream-org/DeepSeek-R1-GGUF-11446/discussions/1#67a327570051a98a96ded9e6

# Using Unofficial Instructions here:
mkdir fp8-to-bf16
cd fp8-to-bf16
uv venv ./venv --python 3.12 --python-preference=only-managed
source venv/bin/activate
uv pip install huggingface-cli

git clone https://github.com/evshiron/llama.cpp --recursive
cd llama.cpp
uv pip install -r requirements/requirements-convert_hf_to_gguf.txt --prerelease=allow --index-strategy unsafe-best-match
cmake -B build
cmake --build build --config Release -j$(nproc)
cd ..

git clone https://github.com/triton-lang/triton-cpu --recursive
cd triton-cpu
# apply saood06's patch https://github.com/ikawrakow/ik_llama.cpp/issues/383#issuecomment-2865306085
uv pip install ninja cmake wheel setuptools pybind11
MAX_JOBS=32 uv pip install -e python --no-build-isolation
# Be patient, "Preparing Packages" downloads a lot of stuff before build begins...
cd ..

# This outputs the <=~50GB gguf splits in the same directory as the original fp8 .safetensors
# you can use --output to specify a dir if you don't have enough space on the disk etc...
# Seems to use less than ~40GB RAM and as much extra RAM as disk cache as available.
# Does *not* use any GPU. A lot of disk i/o is nice to speed up reading/writing too.
# Only seems to use a single CPU thread most of the time.
# Getting just over 700Mbyte/s running on Thread Ripper Pro.
# Requires around 1.4TB of free space to hold the output files.
# Takes just over 30 minute at this speed.
python \
    llama.cpp/convert_hf_to_gguf.py \
    --outtype bf16 \
    --split-max-size 50G \
    path-to/fp8-safetensor-checkpoints/DeepSeek-R1

# Then mv *.gguf into its own directory as well as copy *.py and *.json

# 3. Convert bf16 to Custom MLA repacked quant to fit into your system RAM
# https://github.com/ikawrakow/ik_llama.cpp/pull/244
# importance matrix discussion: https://github.com/ikawrakow/ik_llama.cpp/pull/250
# example command: https://github.com/ikawrakow/ik_llama.cpp/pull/239#issuecomment-2708537218


# 3.5 Compute or download valid imatrix data file (good for <= ~Q4 quants or so)
#     You can download either of these optional imatrix data if making smaller quants <= Q4ish
#     but probably only for DeepSeek-R1 671B. For other models probably roll your own like so:
#     (you might need like 1.5TB RAM to do this with bf16 model, but is easier to
#      make q8_0_r8 quant first, and use that to generate the imatrix.dat with *only* ~715G RAM)
# https://github.com/ikawrakow/ik_llama.cpp/blob/main/examples/imatrix/README.md
# https://github.com/ggml-org/llama.cpp/discussions/5263
# https://gist.github.com/tristandruyen/9e207a95c7d75ddf37525d353e00659c

cd ik_llama.cpp
wget https://gist.githubusercontent.com/tristandruyen/9e207a95c7d75ddf37525d353e00659c/raw/571fda718462de863e5a0171078c175420c7649a/calibration_data_v5_rc.txt
numactl -N 0 -m 0 \
./build/bin/llama-imatrix \
    --verbosity 1 \
    -m /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-Q8_0_R8.gguf \
    -f calibration_data_v5_rc.txt \
    -o imatrix-DeepSeek-V3-0324.dat \
    --ctx-size 512 \
    --numa numactl \
    --threads 128

# Download either of these optional imatrix data files specific to R1. or roll your own like above
# wget https://huggingface.co/bartowski/DeepSeek-R1-GGUF/resolve/main/DeepSeek-R1.imatrix -O imatrix-bartowski-DeepSeek-R1.dat
# wget https://huggingface.co/mradermacher/DeepSeek-R1-i1-GGUF/resolve/main/imatrix.dat -O imatrix-mradermacher-DeepSeek-R1.dat
# UPDATE: I don't recommend using these as only recent PR fixes MLA imatrix 
# https://github.com/ikawrakow/ik_llama.cpp/pull/411

# Test
cd ik_llama.cpp
source venv/bin/activate

# ./build/bin/llama-quantize --help
#  138  or  IQ2_K   :  2.375 bpw non-linear quantization
#  338  or  IQ2_K_R4 : IQ2_K repacked
# https://github.com/ikawrakow/ik_llama.cpp/discussions/242#discussioncomment-12489932
./build/bin/llama-quantize \
    --imatrix /mnt/raid/models/deepseek-ai/DeepSeek-R1-bf16-GGUF/imatrix-bartowski-DeepSeek-R1.dat \
    /mnt/raid/models/deepseek-ai/DeepSeek-R1-bf16-GGUF/DeepSeek-R1-256x21B-BF16-00001-of-00030.gguf \
    /mnt/raid/models/ubergarm/DeepSeek-R1-GGUF/DeepSeek-R1-IQ2_K_R4.gguf \
    IQ2_K_R4 \
    $(nproc)

# Advanced Quants
# https://github.com/ikawrakow/ik_llama.cpp/discussions/242#discussioncomment-12452986
# https://github.com/ikawrakow/ik_llama.cpp/pull/239#issuecomment-2709032571

# Ignore these Notes
# BF16 for all the _a and _b low-rank MLA attention tensors
# Q6_K / Q5_K for all non-shared expert down_proj and up_proj/gate_proj respectively
# and Q8_0 for everything else
# Just changing the _b tensors for Q8_0 (and keeping everything else the same as above) negative effects
# https://github.com/ikawrakow/ik_llama.cpp/pull/239#issuecomment-2708800842
# might not need bf16, possibly numerican instability...
# llama_model_loader: - type  f32:  361 tensors
# llama_model_loader: - type q8_0:  246 tensors
# llama_model_loader: - type q5_K:  116 tensors
# llama_model_loader: - type q6_K:   58 tensors
# llama_model_loader: - type bf16:  488 tensors
# print_info: file format = GGUF V3 (latest)
# print_info: file type   = Q5_K - Medium
# print_info: file size   = 467.54 GiB (5.98 BPW)

# Create a script:
#!/usr/bin/env bash                                                                                                                 14:45:57 [43/1765]

custom="
# Token embedding and output tensors
token_embd\.weight=q8_0
output\.weight=q8_0
output_norm\.weight=q8_0

# First 3 dense layers (GPU0)
blk\.[0-2]\..*=q8_0

# Layers 3-4 (CPU) - MoE experts
blk\.[3-4]\.ffn_down_exps\.weight=q3_k_r4
blk\.[3-4]\.ffn_gate_exps\.weight=q2_k_r4
blk\.[3-4]\.ffn_up_exps\.weight=q2_k_r4

# Layers 5-11 (CPU) - MoE experts
blk\.[5-9]\.ffn_down_exps\.weight=q3_k_r4
blk\.[5-9]\.ffn_gate_exps\.weight=q2_k_r4
blk\.[5-9]\.ffn_up_exps\.weight=q2_k_r4

blk\.1[0-1]\.ffn_down_exps\.weight=q3_k_r4
blk\.1[0-1]\.ffn_gate_exps\.weight=q2_k_r4
blk\.1[0-1]\.ffn_up_exps\.weight=q2_k_r4

# Layers 12-18 (CPU) - MoE experts
blk\.1[2-8]\.ffn_down_exps\.weight=q3_k_r4
blk\.1[2-8]\.ffn_gate_exps\.weight=q2_k_r4
blk\.1[2-8]\.ffn_up_exps\.weight=q2_k_r4

# Layers 19-60 (CPU) - MoE experts
blk\.19\.ffn_down_exps\.weight=q3_k_r4
blk\.19\.ffn_gate_exps\.weight=q2_k_r4
blk\.19\.ffn_up_exps\.weight=q2_k_r4

blk\.[2-5][0-9]\.ffn_down_exps\.weight=q3_k_r4
blk\.[2-5][0-9]\.ffn_gate_exps\.weight=q2_k_r4
blk\.[2-5][0-9]\.ffn_up_exps\.weight=q2_k_r4

blk\.60\.ffn_down_exps\.weight=q3_k_r4
blk\.60\.ffn_gate_exps\.weight=q2_k_r4
blk\.60\.ffn_up_exps\.weight=q2_k_r4

# All attention tensors for MoE layers (3-60)
blk\.[3-9]\.attn_.*=q8_0
blk\.[1-5][0-9]\.attn_.*=q8_0
blk\.60\.attn_.*=q8_0

# Norm weights and bias for MoE layers (3-60)
blk\.[3-9]\.ffn_norm\.weight=q8_0
blk\.[1-5][0-9]\.ffn_norm\.weight=q8_0
blk\.60\.ffn_norm\.weight=q8_0
blk\.[3-9]\.exp_probs_b\.bias=q8_0
blk\.[1-5][0-9]\.exp_probs_b\.bias=q8_0
blk\.60\.exp_probs_b\.bias=q8_0

# Shared experts weights for MoE layers (3-60)
blk\.3\.ffn_.*shexp\.weight=q8_0
blk\.[4-9]\.ffn_.*shexp\.weight=q8_0
blk\.[1-5][0-9]\.ffn_.*shexp\.weight=q8_0
blk\.60\.ffn_.*shexp\.weight=q8_0
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

./build/bin/llama-quantize \
    --imatrix /mnt/raid/models/deepseek-ai/DeepSeek-R1-bf16-GGUF/imatrix-bartowski-DeepSeek-R1.dat \
    --token-embedding-type q8_0 \
    --output-tensor-type q8_0 \
    --custom-q "$custom" \
    /mnt/raid/models/deepseek-ai/DeepSeek-R1-bf16-GGUF/DeepSeek-R1-256x21B-BF16-00001-of-00030.gguf \
    /mnt/raid/models/ubergarm/DeepSeek-R1-GGUF/DeepSeek-R1-GGUF-Q2_K_R4.gguf \
    Q2_K_R4 \
    $(nproc)
# I actually only ever tried half of $(nproc)
# not sure what most optimal speed will come from regarding CPU cores/threads / SMT etc...

# It has taken 40 minutes to 3.2 hours or so depending on exact quants used IQ's seem slow, q2_k_r4 is fast to pack
# TODO: There is no --dry-run but would be nice to have a way to predict final sizes before running?

☝️

Benchmarking

Test Rig

AMD Ryzen Threadripper PRO 7965WX 24-Cores
256GB RAM (8x 32GB KF560R32-32 DDR5-6000 running at JEDEC 4800MHz psure)
~225GB/s mlc memory read bandwidth
RTX A6000 48GB VRAM
Linux TR24 6.13.0-061300-generic #202501302155 SMP PREEMPT_DYNAMIC Sat Feb 8 09:06:55 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
BIOS = NPS1 single NUMA node

llama-bench

Note ik_llama.cpp llama-bench doesn't seem to iterate over all variables so fix these manually for test cases:

-fmoe 0,1
-rtr 0,1
-ot probably, i didn't test this specifically as always using exps=CPU for this rig...

It does seem to iterate over variables for fa, mla, and amb.

# *NOTE*: this test was using `ik/prepare_wk_b` branch to support MLA on existing unsloth quants!
# *NOTE*: newer versions actually support `-ctk q8_0 -mla 2` etc.
# *NOTE*: -rtr 1 was only used with unsloth quant as the custom quant is pre-packed

./build/bin/llama-bench \
    --model /mnt/raid/models/unsloth/DeepSeek-R1-GGUF/DeepSeek-R1-UD-Q2_K_XL/DeepSeek-R1-UD-Q2_K_XL-00001-of-00005.gguf \
    -ctk q8_0 -ctv q8_0 \
    -mla 2 -fa 1 \
    -amb 2048 \
    -fmoe 1 \
    -rtr 1 \
    --n-gpu-layers 63 \
    --override-tensor exps=CPU \
    --threads 24

build: f2fb15de (3596)

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA RTX A6000, compute capability 8.6, VMM: yes

Computed blk.0.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.1.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.2.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.3.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
.
.
.
Computed blk.58.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.59.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.60.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
============ Repacked 174 tensors

model	size	params	backend	ngl	type_k	type_v	fa	mla	amb	rtr	fmoe	test	t/s
DS-R1 671B unsloth UD-Q2_K_XL	211.03 GiB	671.03 B	CUDA	63	q8_0	q8_0	1	2	2048	0	1	pp512	69.85 ± 1.67
DS-R1 671B unsloth UD-Q2_K_XL	211.03 GiB	671.03 B	CUDA	63	q8_0	q8_0	1	2	2048	0	1	tg128	7.35 ± 0.01
DS-R1 671B unsloth UD-Q2_K_XL	211.03 GiB	671.03 B	CUDA	63	q8_0	q8_0	1	2	2048	1	1	pp512	110.79 ± 5.60
DS-R1 671B unsloth UD-Q2_K_XL	211.03 GiB	671.03 B	CUDA	63	q8_0	q8_0	1	2	2048	1	1	tg128	13.13 ± 0.07
DS-R1 671B unsloth UD-Q2_K_XL	211.03 GiB	671.03 B	CUDA	63	f16	f16	1	2	2048	1	1	pp512	114.56 ± 1.75
DS-R1 671B unsloth UD-Q2_K_XL	211.03 GiB	671.03 B	CUDA	63	f16	f16	1	2	2048	1	1	tg128	13.68 ± 0.07
DS-R1 671B ubergarm IQ2_XS_R4	213.11 GiB	672.05 B	CUDA	63	q8_0	q8_0	1	2	2048	0	1	pp512	65.31 ± 1.52
DS-R1 671B ubergarm IQ2_XS_R4	213.11 GiB	672.05 B	CUDA	63	q8_0	q8_0	1	2	2048	0	1	tg128	10.48 ± 0.01
DS-R1 671B ubergarm Q2_K_R4	238.69 GiB	672.05 B	CUDA	63	f16	f16	1	2	2048	0	1	pp512	111.89 ± 2.68
DS-R1 671B ubergarm Q2_K_R4	238.69 GiB	672.05 B	CUDA	63	f16	f16	1	2	2048	0	1	tg128	11.55 ± 0.04
DS-R1 671B ubergarm Q2_K_R4	238.69 GiB	672.05 B	CUDA	63	q8_0	q8_0	1	2	2048	0	1	pp512	109.06 ± 2.86
DS-R1 671B ubergarm Q2_K_R4	238.69 GiB	672.05 B	CUDA	63	q8_0	q8_0	1	2	2048	0	1	tg128	11.10 ± 0.01

Perplexity

# Test your quant against known quants
# Lower is Better
# https://github.com/ikawrakow/ik_llama.cpp/pull/239#issuecomment-2701019253
# example command: https://github.com/ikawrakow/ik_llama.cpp/pull/239#issuecomment-2708537247
wget https://github.com/user-attachments/files/19090237/wiki.test.raw.gz
gunzip wiki.test.raw.gz

# this can takes an hour or more for full run
# but only really need first ~25 points or so
# also some quants give nan results even on vanilla llama.cpp
# *NOTE* I don't think `-ctk q8_0 -ctv q8_0` are valid with `-mla 2 -fa` yet so take this with a grain of salt.
CUDA_VISIBLE_DEVICES="0," \
./build/bin/llama-perplexity \
    --model /mnt/raid/models/ubergarm/DeepSeek-R1-GGUF/DeepSeek-R1-IQ2_XS_R4.gguf \
    -ctk q8_0 \
    -mla 2 -fa \
    -amb 512 \
    -fmoe \
    --ctx-size 512 \
    --ubatch-size 512 \
    -f wiki.test.raw \
    --n-gpu-layers 63 \
    --override-tensor exps=CPU \
    --threads 24

Even more perplexity logs

There is a lot going on here. There may be some issues with nan and "numerical instability" depending on exact quants and llama.cpp forks in use. So this is still evolving.

I made the above png graph using the first 35 chunks for easy comparison as generally nan didn't appear too early for most quants.

I also haven't compared perplexity across ik_llama.cpp with different settings (e.g. mla etc) vs vanilla llama.cpp and CPU vs CUDA backends etc.

The following exact detailed logs results are not included yet in the graph above.

`Q8_0`

I ran the unsloth Q8_0 on that intel6980P CPU only backend with vanilla llama.cpp/main@b1b132ef for a baseline. Note there is no MLA etc yet in this case.

numactl -N 0 -m 0 \
./build/bin/llama-perplexity \
    --model /mnt/ai/models/unsloth/DeepSeek-R1-GGUF/DeepSeek-R1-Q8_0/DeepSeek-R1.Q8_0-00001-of-00015.gguf \
    -ctk f16 -ctv f16 \
    --ctx-size 512 \
    --ubatch-size 512 \
    -f wiki.test.raw \
    --numa numactl \
    --threads 80

perplexity: tokenizing the input ..
perplexity: tokenization took 724.131 ms
perplexity: calculating perplexity over 561 chunks, n_ctx=512, batch_size=2048, n_seq=4
perplexity: 60.35 seconds per pass - ETA 2 hours 21.05 minutes
[1]2.5013,[2]3.2882,[3]2.3700,[4]1.9826,[5]1.7891,[6]1.6469,[7]1.5544,[8]1.4883,[9]1.4387,[10]1.3997,[11]1.3842,[12]1.4194,[13]1.4299,[14]1.5576,[15]1.6890,[16]1.7483,[17]1.9110,[18]2.0408,[19]2.0033,[20]1.9911,[21]2.0982,[22]2.0702,[23]2.0430,[24]2.0560,[25]2.0267,[26]2.0035,[27]2.0524,[28]2.0598,[29]2.1085,[30]2.1396,[31]2.1742,[32]2.1918,[33]2.2304,[34]2.2706,[35]2.3192,[36]2.3717,[37]2.4071,[38]2.4526,[39]2.4940,[40]2.5527,[41]2.5950,[42]2.6072,[43]2.6559,[44]2.6723,[45]2.7517,[46]2.8023,[47]2.7573,[48]2.7107,[49]2.6842,[50]2.7039,[51]2.7504,[52]2.7650,[53]2.8143,[54]2.8275,[55]2.8585,[56]2.8898,[57]2.9036,[58]2.9402,[59]2.9512,[60]2.9968,[61]3.0366,[62]3.0894,[63]3.1213,[64]3.1652,[65]3.1751,[66]3.1579,[67]3.1353,[68]3.1665,[69]3.1618,[70]3.1771,[71]3.1956,[72]3.2115,[73]3.2259,[74]3.2494,[75]3.2284,[76]3.1816,[77]3.1389,[78]3.1344,[79]3.1122,[80]3.0929,[81]3.0561,[82]3.0596,[83]3.0282,[84]2.9923,[85]2.9572,[86]2.9321,[87]2.9257,[88]2.8971,[89]2.8805,[90]2.8542,[91]2.8245,[92]2.7997,[93]2.7731,[94]2.7463,[95]2.7224,[96]2.7210,[97]2.7283,[98]2.7132,[99]2.6960,[100]2.6985,[101]2.6899,[102]2.7065,[103]2.7327,[104]2.7513,[105]2.7482,[106]2.7706,[107]2.7948,[108]2.8154,[109]2.8493,[110]2.8832,[111]2.9028,[112]2.8771,[113]2.8641,[114]2.8419,[115]2.8266,[116]2.8114,[117]2.7885,[118]2.7677,[119]2.7465,[120]2.7277,[121]2.7122,[122]2.6947,[123]2.6785,[124]2.6597,[125]2.6422,[126]2.6257,[127]2.6117,[128]2.6027,[129]2.5920,[130]2.5797,[131]2.5724,[132]2.5798,[133]2.5894,[134]2.5959,[135]2.6064,[136]2.6225,[137]2.6379,[138]2.6461,[139]2.6576,[140]2.6586,[141]2.6603,[142]2.6594,[143]2.6599,[144]2.6569,[145]2.6481,[146]2.6467,[147]2.6512,[148]2.6510,[149]2.6527,[150]2.6476,[151]2.6458,[152]2.6429,[153]2.6392,[154]2.6399,[155]2.6443,[156]2.6465,[157]2.6527,[158]2.6615,[159]2.6634,[160]2.6723,[161]2.6806,[162]2.6900,[163]2.6941,[164]2.7141,[165]2.7378,[166]2.7551,[167]2.7673,[168]2.7915,[169]2.8139,[170]2.8354,[171]2.8586,[172]2.8427,[173]2.8264,[174]2.8128,[175]2.7995,[176]2.7872,[177]2.7756,[178]2.7630,[179]2.7493,[180]2.7532,[181]2.7671,[182]2.7822,[183]2.7970,[184]2.8112,[185]2.8216,[186]2.8381,[187]2.8534,[188]2.8675,[189]2.8782,[190]2.8785,[191]2.8859,[192]2.8899,[193]2.8950,[194]2.9146,[195]2.9234,[196]2.9368,[197]2.9468,[198]2.9513,[199]2.9570,[200]2.9566,[201]2.9717,[202]2.9671,[203]2.9724,[204]2.9760,[205]2.9759,[206]2.9785,[207]2.9874,[208]2.9970,[209]3.0063,[210]3.0069,[211]3.0022,[212]3.0021,[213]3.0097,[214]3.0116,[215]3.0174,[216]3.0180,[217]3.0140,[218]3.0142,[219]3.0152,[220]3.0146,[221]3.0148,[222]3.0149,[223]3.0155,[224]3.0205,[225]3.0224,[226]3.0144,[227]3.0122,[228]3.0145,[229]3.0191,[230]3.0256,[231]3.0318,[232]3.0236,[233]3.0158,[234]3.0158,[235]3.0142,[236]3.0230,[237]3.0315,[238]3.0410,[239]3.0508,[240]3.0601,[241]3.0713,[242]3.0857,[243]3.0992,[244]3.1073,[245]3.1183,[246]3.1288,[247]3.1276,[248]3.1235,[249]3.1216,[250]3.1154,[251]3.1133,[252]3.1158,[253]3.1196,[254]3.1267,[255]3.1331,[256]3.1369,[257]3.1393,[258]3.1405,[259]3.1438,[260]3.1459,[261]3.1473,[262]3.1465,[263]3.1522,[264]3.1545,[265]3.1550,[266]3.1568,[267]3.1597,[268]3.1634,[269]3.1665,[270]3.1659,[271]3.1644,[272]3.1577,[273]3.1576,[274]3.1507,[275]3.1399,[276]3.1291,[277]3.1308,[278]3.1410,[279]3.1472,[280]3.1551,[281]3.1625,[282]3.1687,[283]3.1751,[284]3.1818,[285]3.1954,[286]3.1979,[287]3.2013,[288]3.2060,[289]3.2087,[290]3.2005,[291]3.1911,[292]3.1892,[293]3.1883,[294]3.1855,[295]3.1829,[296]3.1848,[297]3.1853,[298]3.1902,[299]3.1961,[300]3.1992,[301]3.2030,[302]3.2052,[303]3.2072,[304]3.2067,[305]3.2186,[306]3.2261,[307]3.2370,[308]3.2258,[309]3.2204,[310]3.2109,[311]3.2145,[312]3.2167,[313]3.2230,[314]3.2251,[315]3.2283,[316]3.2297,[317]3.2315,[318]3.2321,[319]3.2324,[320]3.2367,[321]3.2370,[322]3.2390,[323]3.2454,[324]3.2463,[325]3.2516,[326]3.2563,[327]3.2604,[328]3.2634,[329]3.2652,[330]3.2715,[331]3.2752,[332]3.2800,[333]3.2786,[334]3.2787,[335]3.2792,[336]3.2794,[337]3.2805,[338]3.2808,[339]3.2835,[340]3.2871,[341]3.2925,[342]3.3015,[343]3.3108,[344]3.3161,[345]3.3074,[346]3.2997,[347]3.2945,[348]3.2872,[349]3.2835,[350]3.2817,[351]3.2864,[352]3.3013,[353]3.3104,[354]3.3232,[355]3.3318,[356]3.3371,[357]3.3487,[358]3.3583,[359]3.3615,[360]3.3680,[361]3.3772,[362]3.3858,[363]3.3915,[364]3.3981,[365]3.4044,[366]3.4148,[367]3.4234,[368]3.4301,[369]3.4380,[370]3.4465,[371]3.4602,[372]3.4689,[373]3.4722,[374]3.4758,[375]3.4808,[376]3.4936,[377]3.5048,[378]3.5075,[379]3.5069,[380]3.5037,[381]3.5083,[382]3.5139,[383]3.5175,[384]3.5218,[385]3.5257,[386]3.5319,[387]3.5377,[388]3.5411,[389]3.5308,[390]3.5213,[391]3.5107,[392]3.5051,[393]3.4955,[394]3.4865,[395]3.4772,[396]3.4672,[397]3.4584,[398]3.4488,[399]3.4385,[400]3.4296,[401]3.4196,[402]3.4093,[403]3.4007,[404]3.3905,[405]3.3811,[406]3.3711,[407]3.3619,[408]3.3531,[409]3.3446,[410]3.3386,[411]3.3392,[412]3.3345,[413]3.3363,[414]3.3385,[415]3.3353,[416]3.3351,[417]3.3375,[418]3.3317,[419]3.3332,[420]3.3308,[421]3.3298,[422]3.3312,[423]3.3304,[424]3.3346,[425]3.3341,[426]3.3346,[427]3.3335,[428]3.3360,[429]3.3378,[430]3.3406,[431]3.3413,[432]3.3403,[433]3.3366,[434]3.3366,[435]3.3289,[436]3.3226,[437]3.3185,[438]3.3167,[439]3.3134,[440]3.3183,[441]3.3237,[442]3.3311,[443]3.3293,[444]3.3302,[445]3.3315,[446]3.3363,[447]3.3396,[448]3.3421,[449]3.3452,[450]3.3490,[451]3.3520,[452]3.3540,[453]3.3557,[454]3.3543,[455]3.3564,[456]3.3567,[457]3.3594,[458]3.3646,[459]3.3653,[460]3.3654,[461]3.3622,[462]3.3659,[463]3.3732,[464]3.3785,[465]3.3714,[466]3.3696,[467]3.3677,[468]3.3688,[469]3.3658,[470]3.3631,[471]3.3634,[472]3.3640,[473]3.3632,[474]3.3624,[475]3.3635,[476]3.3619,[477]3.3610,[478]3.3617,[479]3.3633,[480]3.3660,[481]3.3620,[482]3.3654,[483]3.3646,[484]3.3682,[485]3.3746,[486]3.3775,[487]3.3812,[488]3.3864,[489]3.3889,[490]3.3935,[491]3.3997,[492]3.4042,[493]3.4040,[494]3.4052,[495]3.4076,[496]3.4095,[497]3.4124,[498]3.4127,[499]3.4122,[500]3.4163,[501]3.4209,[502]3.4200,[503]3.4185,[504]3.4205,[505]3.4239,[506]3.4323,[507]3.4350,[508]3.4385,[509]3.4312,[510]3.4254,[511]3.4188,[512]3.4142,[513]3.4080,[514]3.4065,[515]3.4084,[516]3.4033,[517]3.4032,[518]3.4024,[519]3.4029,[520]3.4073,[521]3.4062,[522]3.4047,[523]3.4105,[524]3.4092,[525]3.4076,[526]3.4028,[527]3.3979,[528]3.3942,[529]3.3913,[530]3.3883,[531]3.3852,[532]3.3797,[533]3.3735,[534]3.3692,[535]3.3700,[536]3.3728,[537]3.3759,[538]3.3785,[539]3.3812,[540]3.3865,[541]3.3898,[542]3.3922,[543]3.3865,[544]3.3822,[545]3.3819,[546]3.3753,[547]3.3688,[548]3.3624,[549]3.3557,[550]3.3497,[551]3.3436,[552]3.3378,[553]3.3319,[554]3.3298,[555]3.3283,[556]3.3311,[557]3.3351,[558]3.3410,[559]3.3455,[560]3.3508,[561]3.3490,
Final estimate: PPL = 3.3490 +/- 0.01849

llama_perf_context_print:        load time =  226439.86 ms
llama_perf_context_print: prompt eval time = 8320298.42 ms / 287232 tokens (   28.97 ms per token,    34.52 tokens per second)
llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:       total time = 8511632.28 ms / 287233 tokens

ubergarm `Q2_K_R4`

This is a custom quant I rolled with q8_0 for all attention/shared experts/embeddings loaded on GPU. The rest of the MoE down exps are q3_k_r4 and gate/up exps are q2_k_r4 which gives fast speed quant that fits nicely into under 256GB RAM and 24GB VRAM with about 32k context without sacrificing much perplexity.

This was run on ik_llama.cpp@127c6ee6

CUDA_VISIBLE_DEVICES="0," \
./build/bin/llama-perplexity \
    --model /mnt/raid/models/ubergarm/DeepSeek-R1-GGUF/DeepSeek-R1-Q2_K_R4.gguf \
    -ctk q8_0 \
    -mla 2 -fa \
    -amb 512 \
    -fmoe \
    --ctx-size 512 \
    --ubatch-size 512 \
    -f wiki.test.raw \
    --seed 1337 \
    --n-gpu-layers 63 \
    --override-tensor exps=CPU \
    --threads 24

main: build = 3597 (127c6ee6)
main: built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
main: seed  = 1337

llama_model_loader: - type  f32:  361 tensors
llama_model_loader: - type q8_0:  612 tensors
llama_model_loader: - type q2_k_r4:  116 tensors
llama_model_loader: - type q3_k_r4:   58 tensors

llm_load_tensors:        CPU buffer size = 241396.85 MiB
llm_load_tensors:        CPU buffer size =   938.98 MiB
llm_load_tensors:      CUDA0 buffer size = 17744.02 MiB

llama_kv_cache_init:      CUDA0 KV buffer size =    72.94 MiB
llama_new_context_with_model: KV self size  =   72.91 MiB, c^KV (q8_0):   72.91 MiB, kv^T: not used
llama_new_context_with_model:  CUDA_Host  output buffer size =     1.97 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   503.00 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =   162.01 MiB
llama_new_context_with_model: graph nodes  = 3548
llama_new_context_with_model: graph splits = 118

system_info: n_threads = 24 / 48 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NE
ON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1
|
perplexity: tokenizing the input ..
perplexity: tokenization took 622.117 ms
perplexity: calculating perplexity over 561 chunks, n_ctx=512, batch_size=2048, n_seq=4
perplexity: 22.17 seconds per pass - ETA 51.82 minutes
[1]2.6638,[2]3.4777,[3]2.4750,[4]2.0889,[5]1.9114,[6]1.7840,[7]1.6778,[8]1.6280,[9]1.5861,[10]1.5368,[11]1.5350,[12]1.6021,[13]1.6219,[14]1.7566,[15]1.8981,[16]1.9568,[17]2.1267,[18]2.2596,[19]2.2162,[20]2.2076,[21]2.3177,[22]2.2827,[23]2.2506,[24]2.2664,[25]2.2356,[26]2.2031,[27]2.2509,[28]2.2621,[29]2.3150,[30]2.3456,[31]2.3842,[32]2.4047,[33]2.4491,[34]2.4968,[35]2.5548,[36]2.6101,[37]2.6450,[38]2.6943,[39]2.7349,[40]2.7982,[41]2.8432,[42]2.8527,[43]2.9058,[44]2.9198,[45]3.0016,[46]3.0547,[47]3.0161,[48]2.9682,[49]2.9447,[50]2.9692,[51]3.0185,[52]3.0358,[53]3.0904,[54]3.1052,[55]3.1362,[56]3.1730,[57]3.1878,[58]3.2298,[59]3.2355,[60]3.2852,[61]3.3261,[62]3.3815,[63]3.4167,[64]3.4623,[65]3.4705,[66]3.4568,[67]3.4360,[68]3.4732,[69]3.4763,[70]3.4917,[71]3.5079,[72]3.5222,[73]3.5335,[74]3.5558,[75]3.5337,[76]3.4827,[77]3.4411,[78]3.4385,[79]3.4195,[80]3.4069,[81]3.3681,[82]3.3782,[83]3.3509,[84]3.3178,[85]3.2861,[86]3.2623,[87]3.2651,[88]3.2385,[89]3.2313,[90]3.2041,[91]3.1805,[92]3.1557,[93]3.1293,[94]3.1076,[95]3.0903,[96]3.0928,[97]3.1020,[98]3.0908,[99]3.0718,[100]3.0734,[101]3.0656,[102]3.0834,[103]3.1118,[104]3.1334,[105]3.1289,[106]3.1553,[107]3.1798,[108]3.2007,[109]3.2368,[110]3.2717,[111]3.2932,[112]3.2641,[113]3.2514,[114]3.2308,[115]3.2142,[116]3.2089,[117]3.1865,[118]3.1646,[119]3.1440,[120]3.1220,[121]3.1077,[122]3.0867,[123]3.0684,[124]3.0491,[125]3.0306,[126]3.0122,[127]2.9989,[128]2.9941,[129]2.9858,[130]2.9752,[131]2.9681,[132]2.9766,[133]2.9844,[134]2.9892,[135]3.0006,[136]3.0188,[137]3.0355,[138]3.0423,[139]3.0529,[140]3.0518,[141]3.0514,[142]3.0485,[143]3.0472,[144]3.0406,[145]3.0305,[146]3.0274,[147]3.0301,[148]3.0286,[149]3.0286,[150]3.0209,[151]3.0173,[152]3.0128,[153]3.0070,[154]3.0063,[155]3.0096,[156]3.0102,[157]3.0149,[158]3.0234,[159]3.0244,[160]3.0334,[161]3.0417,[162]3.0509,[163]3.0566,[164]3.0781,[165]3.1021,[166]3.1200,[167]3.1341,[168]3.1601,[169]3.1830,[170]3.2043,[171]3.2285,[172]3.2094,[173]3.1897,[174]3.1763,[175]3.1635,[176]3.1512,[177]3.1393,[178]3.1260,[179]3.1114,[180]3.1151,[181]3.1294,[182]3.1451,[183]3.1596,[184]3.1737,[185]3.1836,[186]3.2002,[187]3.2150,[188]3.2297,[189]3.2397,[190]3.2401,[191]3.2467,[192]3.2485,[193]3.2522,[194]3.2726,[195]3.2824,[196]3.2955,[197]3.3053,[198]3.3084,[199]3.3139,[200]3.3115,[201]3.3268,[202]3.3208,[203]3.3263,[204]3.3285,[205]3.3289,[206]3.3309,[207]3.3401,[208]3.3495,[209]3.3596,[210]3.3591,[211]3.3530,[212]3.3525,[213]3.3601,[214]3.3613,[215]3.3673,[216]3.3670,[217]3.3614,[218]3.3608,[219]3.3607,[220]3.3586,[221]3.3583,[222]3.3578,[223]3.3582,[224]3.3630,[225]3.3651,[226]3.3555,[227]3.3541,[228]3.3557,[229]3.3600,[230]3.3664,[231]3.3725,[232]3.3629,[233]3.3560,[234]3.3588,[235]3.3588,[236]3.3679,[237]3.3768,[238]3.3863,[239]3.3968,[240]3.4056,[241]3.4171,[242]3.4330,[243]3.4464,[244]3.4550,[245]3.4673,[246]3.4779,[247]3.4755,[248]3.4711,[249]3.4687,[250]3.4611,[251]3.4578,[252]3.4592,[253]3.4623,[254]3.4688,[255]3.4747,[256]3.4776,[257]3.4796,[258]3.4799,[259]3.4823,[260]3.4840,[261]3.4844,[262]3.4823,[263]3.4878,[264]3.4897,[265]3.4893,[266]3.4911,[267]3.4934,[268]3.4977,[269]3.5007,[270]3.4989,[271]3.4964,[272]3.4887,[273]3.4893,[274]3.4830,[275]3.4721,[276]3.4619,[277]3.4634,[278]3.4747,[279]3.4802,[280]3.4880,[281]3.4954,[282]3.5012,[283]3.5084,[284]3.5151,[285]3.5294,[286]3.5318,[287]3.5344,[288]3.5386,[289]3.5405,[290]3.5319,[291]3.5245,[292]3.5265,[293]3.5266,[294]3.5257,[295]3.5240,[296]3.5264,[297]3.5278,[298]3.5327,[299]3.5397,[300]3.5427,[301]3.5466,[302]3.5492,[303]3.5500,[304]3.5482,[305]3.5604,[306]3.5677,[307]3.5791,[308]3.5665,[309]3.5614,[310]3.5521,[311]3.5569,[312]3.5602,[313]3.5680,[314]3.5700,[315]3.5730,[316]3.5737,[317]3.5747,[318]3.5748,[319]3.5752,[320]3.5794,[321]3.5793,[322]3.5807,[323]3.5867,[324]3.5868,[325]3.5913,[326]3.5962,[327]3.5998,[328]3.6018,[329]3.6030,[330]3.6091,[331]3.6139,[332]3.6182,[333]3.6161,[334]3.6152,[335]3.6149,[336]3.6146,[337]3.6152,[338]3.6152,[339]3.6172,[340]3.6206,[341]3.6262,[342]3.6355,[343]3.6454,[344]3.6503,[345]3.6426,[346]3.6354,[347]3.6331,[348]3.6250,[349]3.6211,[350]3.6196,[351]3.6242,[352]3.6400,[353]3.6490,[354]3.6624,[355]3.6718,[356]3.6773,[357]3.6895,[358]3.7002,[359]3.7034,[360]3.7098,[361]3.7190,[362]3.7284,[363]3.7341,[364]3.7405,[365]3.7472,[366]3.7586,[367]3.7673,[368]3.7743,[369]3.7824,[370]3.7911,[371]3.8057,[372]3.8153,[373]3.8182,[374]3.8215,[375]3.8263,[376]3.8395,[377]3.8505,[378]3.8528,[379]3.8518,[380]3.8480,[381]3.8524,[382]3.8581,[383]3.8616,[384]3.8662,[385]3.8700,[386]3.8763,[387]3.8823,[388]3.8854,[389]3.8739,[390]3.8638,[391]3.8534,[392]3.8475,[393]3.8382,[394]3.8292,[395]3.8196,[396]3.8089,[397]3.7993,[398]3.7888,[399]3.7777,[400]3.7692,[401]3.7583,[402]3.7471,[403]3.7373,[404]3.7257,[405]3.7151,[406]3.7038,[407]3.6937,[408]3.6845,[409]3.6753,[410]3.6691,[411]3.6709,[412]3.6663,[413]3.6695,[414]3.6725,[415]3.6698,[416]3.6700,[417]3.6722,[418]3.6661,[419]3.6677,[420]3.6650,[421]3.6640,[422]3.6657,[423]3.6652,[424]3.6696,[425]3.6691,[426]3.6693,[427]3.6687,[428]3.6715,[429]3.6729,[430]3.6760,[431]3.6769,[432]3.6759,[433]3.6722,[434]3.6730,[435]3.6667,[436]3.6610,[437]3.6572,[438]3.6553,[439]3.6538,[440]3.6589,[441]3.6640,[442]3.6715,[443]3.6693,[444]3.6698,[445]3.6710,[446]3.6763,[447]3.6788,[448]3.6813,[449]3.6840,[450]3.6879,[451]3.6915,[452]3.6939,[453]3.6952,[454]3.6932,[455]3.6955,[456]3.6953,[457]3.6978,[458]3.7028,[459]3.7032,[460]3.7027,[461]3.6988,[462]3.7024,[463]3.7098,[464]3.7157,[465]3.7091,[466]3.7079,[467]3.7076,[468]3.7093,[469]3.7067,[470]3.7041,[471]3.7044,[472]3.7055,[473]3.7047,[474]3.7034,[475]3.7047,[476]3.7031,[477]3.7023,[478]3.7030,[479]3.7053,[480]3.7078,[481]3.7041,[482]3.7078,[483]3.7063,[484]3.7096,[485]3.7163,[486]3.7190,[487]3.7225,[488]3.7279,[489]3.7299,[490]3.7346,[491]3.7405,[492]3.7450,[493]3.7447,[494]3.7457,[495]3.7479,[496]3.7495,[497]3.7526,[498]3.7526,[499]3.7518,[500]3.7555,[501]3.7599,[502]3.7587,[503]3.7567,[504]3.7593,[505]3.7622,[506]3.7705,[507]3.7730,[508]3.7763,[509]3.7681,[510]3.7634,[511]3.7571,[512]3.7529,[513]3.7470,[514]3.7466,[515]3.7497,[516]3.7454,[517]3.7459,[518]3.7450,[519]3.7460,[520]3.7510,[521]3.7495,[522]3.7477,[523]3.7541,[524]3.7529,[525]3.7515,[526]3.7476,[527]3.7418,[528]3.7389,[529]3.7353,[530]3.7325,[531]3.7289,[532]3.7221,[533]3.7155,[534]3.7116,[535]3.7130,[536]3.7160,[537]3.7199,[538]3.7231,[539]3.7259,[540]3.7314,[541]3.7352,[542]3.7375,[543]3.7323,[544]3.7285,[545]3.7281,[546]3.7207,[547]3.7147,[548]3.7080,[549]3.7014,[550]3.6956,[551]3.6899,[552]3.6844,[553]3.6791,[554]3.6786,[555]3.6772,[556]3.6796,[557]3.6838,[558]3.6899,[559]3.6946,[560]3.7001,[561]3.6975,
Final estimate: PPL = 3.6975 +/- 0.02115

llama_print_timings:        load time =   14720.43 ms
llama_print_timings:      sample time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time = 2646411.18 ms / 287232 tokens (    9.21 ms per token,   108.54 tokens per second)
llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time = 2649939.46 ms / 287233 tokens

ubergarm `Q2_K_R4` with various `-ser N,1`

Testing same quant and config as above but with -ser 4,1 etc to get a feel for quality vs speed tradeoffs.

These were run on ik_llama.cpp@127c6ee6

CUDA_VISIBLE_DEVICES="0," \
./build/bin/llama-perplexity \
    --model /mnt/raid/models/ubergarm/DeepSeek-R1-GGUF/DeepSeek-R1-Q2_K_R4.gguf \
    -ctk q8_0 \
    -mla 2 -fa \
    -amb 512 \
    -fmoe \
    -ser 4,1 \
    --ctx-size 512 \
    --ubatch-size 512 \
    -f wiki.test.raw \
    --seed 1337 \
    --n-gpu-layers 63 \
    --override-tensor exps=CPU \
    --threads 24

  Device 0: NVIDIA RTX A6000, compute capability 8.6, VMM: yes
main: build = 3597 (127c6ee6)
main: built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
main: seed  = 1337
llama_model_loader: loaded meta data with 48 key-value pairs and 1147 tensors from /mnt/raid/models/ubergarm/DeepSeek-R1-GGUF/DeepSeek-R1-Q2_K_R4.gguf
 (version GGUF V3 (latest))

llama_model_loader: - type  f32:  361 tensors
llama_model_loader: - type q8_0:  612 tensors
llama_model_loader: - type q2_k_r4:  116 tensors
llama_model_loader: - type q3_k_r4:   58 tensors

llm_load_tensors:        CPU buffer size = 241396.85 MiB
llm_load_tensors:        CPU buffer size =   938.98 MiB
llm_load_tensors:      CUDA0 buffer size = 17744.02 MiB

llama_kv_cache_init:      CUDA0 KV buffer size =    72.94 MiB
llama_new_context_with_model: KV self size  =   72.91 MiB, c^KV (q8_0):   72.91 MiB, kv^T: not used
llama_new_context_with_model:  CUDA_Host  output buffer size =     1.97 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   503.00 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =   162.01 MiB
llama_new_context_with_model: graph nodes  = 3548
llama_new_context_with_model: graph splits = 118

system_info: n_threads = 24 / 48 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NE
ON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1
|

# with -ser 4,1
perplexity: tokenizing the input ..
perplexity: tokenization took 604.75 ms
perplexity: calculating perplexity over 561 chunks, n_ctx=512, batch_size=2048, n_seq=4
perplexity: 13.04 seconds per pass - ETA 30.48 minutes
[1]2.7566,[2]3.5635,[3]2.5376,[4]2.2133,[5]2.0562,[6]1.9544,[7]1.8575,[8]1.8206,[9]1.7899,[10]1.7276,[11]1.7315,[12]1.8148,[13]1.8621,[14]1.9970,[15]2.1476,[16]2.2009,[17]2.3909,[18]2.5311,[19]2.4924,[20]2.4660,[21]2.5846,[22]2.5381,[23]2.4909,[24]2.5169,[25]2.4747,[26]2.4415,[27]2.4895,[28]2.4900,[29]2.5527,[30]2.5844,[31]2.6249,[32]2.6419,[33]2.6900,[34]2.7411,[35]2.8049,[36]2.8666,[37]2.9000,[38]2.9508,[39]2.9934,[40]3.0584,[41]3.0966,[42]3.1029,[43]3.1541,[44]3.1631,[45]3.2510,[46]3.3056,[47]3.2714,[48]3.2337,[49]3.2203,[50]3.2441,[51]3.2937,[52]3.3088,[53]3.3648,[54]3.3842,[55]3.4177,[56]3.4566,[57]3.4802,[58]3.5231,[59]3.5286,[60]3.5828,[61]3.6248,[62]3.6818,[63]3.7188,[64]3.7669,[65]3.7770,[66]3.7741,[67]3.7554,[68]3.7894,[69]3.7957,[70]3.8155,[71]3.8336,[72]3.8482,[73]3.8581,[74]3.8803,[75]3.8576,[76]3.8006,[77]3.7567,[78]3.7570,[79]3.7380,[80]3.7306,[81]3.6892,[82]3.6976,[83]3.6788,[84]3.6468,[85]3.6175,[86]3.5977,[87]3.6166,[88]3.5909,[89]3.5849,[90]3.5628,[91]3.5419,[92]3.5188,[93]3.4947,[94]3.4766,[95]3.4582,[96]3.4635,[97]3.4770,[98]3.4648,[99]3.4479,[100]3.4481,[101]3.4369,[102]3.4545,[103]3.4847,[104]3.5091,[105]3.5066,[106]3.5396,[107]3.5644,[108]3.5854,[109]3.6243,[110]3.6607,[111]3.6853,[112]3.6525,[113]3.6384,[114]3.6172,[115]3.5987,[116]3.5923,[117]3.5714,[118]3.5475,[119]3.5258,[120]3.5023,[121]3.4869,[122]3.4619,[123]3.4426,[124]3.4229,[125]3.4047,[126]3.3876,[127]3.3766,[128]3.3707,[129]3.3639,[130]3.3555,[131]3.3492,[132]3.3556,[133]3.3630,[134]3.3679,[135]3.3806,[136]3.3993,[137]3.4173,[138]3.4236,[139]3.4345,[140]3.4313,[141]3.4291,[142]3.4229,[143]3.4184,[144]3.4084,[145]3.3970,[146]3.3921,[147]3.3929,[148]3.3895,[149]3.3881,[150]3.3773,[151]3.3724,[152]3.3654,[153]3.3570,[154]3.3543,[155]3.3575,[156]3.3558,[157]3.3599,[158]3.3687,[159]3.3700,[160]3.3792,[161]3.3861,[162]3.3940,[163]3.4013,[164]3.4242,[165]3.4507,[166]3.4707,[167]3.4853,[168]3.5134,[169]3.5376,[170]3.5636,[171]3.5889,[172]3.5672,[173]3.5461,[174]3.5336,[175]3.5224,[176]3.5099,[177]3.4987,[178]3.4862,[179]3.4722,[180]3.4760,[181]3.4907,[182]3.5072,[183]3.5225,[184]3.5380,[185]3.5492,[186]3.5669,[187]3.5825,[188]3.5986,[189]3.6102,[190]3.6092,[191]3.6161,[192]3.6179,[193]3.6219,[194]3.6438,[195]3.6527,[196]3.6656,[197]3.6750,[198]3.6773,[199]3.6828,[200]3.6787,[201]3.6945,[202]3.6859,[203]3.6899,[204]3.6913,[205]3.6913,[206]3.6915,[207]3.7009,[208]3.7091,[209]3.7186,[210]3.7168,[211]3.7094,[212]3.7082,[213]3.7154,[214]3.7162,[215]3.7221,[216]3.7205,[217]3.7133,[218]3.7120,[219]3.7115,[220]3.7083,[221]3.7062,[222]3.7049,[223]3.7052,[224]3.7097,[225]3.7106,[226]3.7010,[227]3.6990,[228]3.7001,[229]3.7028,[230]3.7086,[231]3.7142,[232]3.7035,[233]3.6969,[234]3.7003,[235]3.7000,[236]3.7105,[237]3.7196,[238]3.7296,[239]3.7397,[240]3.7490,[241]3.7612,[242]3.7780,[243]3.7920,[244]3.8010,[245]3.8136,[246]3.8253,[247]3.8218,[248]3.8166,[249]3.8127,[250]3.8035,[251]3.7989,[252]3.7990,[253]3.8014,[254]3.8078,[255]3.8131,[256]3.8157,[257]3.8173,[258]3.8165,[259]3.8192,[260]3.8210,[261]3.8216,[262]3.8184,[263]3.8242,[264]3.8259,[265]3.8253,[266]3.8270,[267]3.8292,[268]3.8335,[269]3.8366,[270]3.8339,[271]3.8310,[272]3.8212,[273]3.8237,[274]3.8171,[275]3.8064,[276]3.7978,[277]3.8000,[278]3.8117,[279]3.8180,[280]3.8261,[281]3.8342,[282]3.8406,[283]3.8481,[284]3.8552,[285]3.8705,[286]3.8717,[287]3.8735,[288]3.8772,[289]3.8784,[290]3.8700,[291]3.8628,[292]3.8670,[293]3.8667,[294]3.8666,[295]3.8643,[296]3.8674,[297]3.8695,[298]3.8749,[299]3.8810,[300]3.8834,[301]3.8873,[302]3.8905,[303]3.8920,[304]3.8897,[305]3.9028,[306]3.9107,[307]3.9233,[308]3.9105,[309]3.9049,[310]3.8953,[311]3.9003,[312]3.9029,[313]3.9102,[314]3.9117,[315]3.9139,[316]3.9146,[317]3.9158,[318]3.9153,[319]3.9149,[320]3.9197,[321]3.9192,[322]3.9198,[323]3.9267,[324]3.9268,[325]3.9321,[326]3.9366,[327]3.9413,[328]3.9428,[329]3.9432,[330]3.9494,[331]3.9548,[332]3.9594,[333]3.9565,[334]3.9546,[335]3.9540,[336]3.9526,[337]3.9527,[338]3.9517,[339]3.9532,[340]3.9559,[341]3.9612,[342]3.9708,[343]3.9821,[344]3.9881,[345]3.9815,[346]3.9747,[347]3.9737,[348]3.9658,[349]3.9626,[350]3.9605,[351]3.9653,[352]3.9825,[353]3.9922,[354]4.0070,[355]4.0165,[356]4.0224,[357]4.0353,[358]4.0467,[359]4.0498,[360]4.0566,[361]4.0663,[362]4.0752,[363]4.0821,[364]4.0883,[365]4.0951,[366]4.1072,[367]4.1167,[368]4.1239,[369]4.1321,[370]4.1405,[371]4.1558,[372]4.1662,[373]4.1686,[374]4.1717,[375]4.1765,[376]4.1906,[377]4.2018,[378]4.2036,[379]4.2020,[380]4.1979,[381]4.2015,[382]4.2068,[383]4.2105,[384]4.2151,[385]4.2190,[386]4.2261,[387]4.2320,[388]4.2353,[389]4.2226,[390]4.2128,[391]4.2012,[392]4.1953,[393]4.1874,[394]4.1781,[395]4.1686,[396]4.1579,[397]4.1479,[398]4.1364,[399]4.1252,[400]4.1158,[401]4.1039,[402]4.0928,[403]4.0826,[404]4.0696,[405]4.0578,[406]4.0457,[407]4.0346,[408]4.0253,[409]4.0163,[410]4.0103,[411]4.0126,[412]4.0087,[413]4.0125,[414]4.0162,[415]4.0133,[416]4.0137,[417]4.0178,[418]4.0120,[419]4.0138,[420]4.0103,[421]4.0092,[422]4.0116,[423]4.0108,[424]4.0153,[425]4.0150,[426]4.0145,[427]4.0133,[428]4.0172,[429]4.0179,[430]4.0210,[431]4.0221,[432]4.0206,[433]4.0161,[434]4.0172,[435]4.0101,[436]4.0042,[437]3.9999,[438]3.9976,[439]3.9962,[440]4.0016,[441]4.0068,[442]4.0145,[443]4.0118,[444]4.0119,[445]4.0124,[446]4.0178,[447]4.0204,[448]4.0229,[449]4.0258,[450]4.0300,[451]4.0332,[452]4.0355,[453]4.0372,[454]4.0350,[455]4.0366,[456]4.0358,[457]4.0386,[458]4.0437,[459]4.0436,[460]4.0429,[461]4.0385,[462]4.0420,[463]4.0498,[464]4.0555,[465]4.0492,[466]4.0484,[467]4.0478,[468]4.0507,[469]4.0484,[470]4.0456,[471]4.0462,[472]4.0475,[473]4.0461,[474]4.0448,[475]4.0461,[476]4.0445,[477]4.0431,[478]4.0452,[479]4.0474,[480]4.0498,[481]4.0451,[482]4.0485,[483]4.0468,[484]4.0501,[485]4.0570,[486]4.0598,[487]4.0636,[488]4.0693,[489]4.0709,[490]4.0753,[491]4.0819,[492]4.0865,[493]4.0859,[494]4.0871,[495]4.0892,[496]4.0911,[497]4.0942,[498]4.0940,[499]4.0930,[500]4.0963,[501]4.1008,[502]4.0998,[503]4.0970,[504]4.0993,[505]4.1025,[506]4.1110,[507]4.1133,[508]4.1169,[509]4.1081,[510]4.1046,[511]4.0984,[512]4.0942,[513]4.0882,[514]4.0876,[515]4.0906,[516]4.0874,[517]4.0874,[518]4.0871,[519]4.0877,[520]4.0927,[521]4.0910,[522]4.0893,[523]4.0966,[524]4.0959,[525]4.0942,[526]4.0906,[527]4.0840,[528]4.0809,[529]4.0771,[530]4.0736,[531]4.0699,[532]4.0620,[533]4.0548,[534]4.0513,[535]4.0528,[536]4.0558,[537]4.0596,[538]4.0640,[539]4.0670,[540]4.0730,[541]4.0768,[542]4.0797,[543]4.0759,[544]4.0717,[545]4.0708,[546]4.0626,[547]4.0565,[548]4.0490,[549]4.0425,[550]4.0367,[551]4.0308,[552]4.0249,[553]4.0194,[554]4.0198,[555]4.0182,[556]4.0211,[557]4.0259,[558]4.0322,[559]4.0371,[560]4.0430,[561]4.0400,
Final estimate: PPL = 4.0400 +/- 0.02311

llama_print_timings:        load time =   36413.72 ms
llama_print_timings:      sample time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time = 1702951.63 ms / 287232 tokens (    5.93 ms per token,   168.67 tokens per second)
llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time = 1706441.65 ms / 287233 tokens

## again with -ser 6,1
llama_kv_cache_init:      CUDA0 KV buffer size =    72.94 MiB
llama_new_context_with_model: KV self size  =   72.91 MiB, c^KV (q8_0):   72.91 MiB, kv^T: not used
llama_new_context_with_model:  CUDA_Host  output buffer size =     1.97 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   503.00 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =   162.01 MiB
llama_new_context_with_model: graph nodes  = 3548
llama_new_context_with_model: graph splits = 118

system_info: n_threads = 24 / 48 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
perplexity: tokenizing the input ..
perplexity: tokenization took 608.059 ms
perplexity: calculating perplexity over 561 chunks, n_ctx=512, batch_size=2048, n_seq=4
perplexity: 15.81 seconds per pass - ETA 36.93 minutes
[1]2.6383,[2]3.4392,[3]2.4566,[4]2.0850,[5]1.9090,[6]1.7848,[7]1.6805,[8]1.6308,[9]1.5919,[10]1.5463,[11]1.5494,[12]1.6200,[13]1.6404,[14]1.7746,[15]1.9251,[16]1.9812,[17]2.1567,[18]2.2874,[19]2.2496,[20]2.2360,[21]2.3495,[22]2.3124,[23]2.2781,[24]2.2966,[25]2.2613,[26]2.2293,[27]2.2764,[28]2.2883,[29]2.3441,[30]2.3747,[31]2.4141,[32]2.4356,[33]2.4773,[34]2.5225,[35]2.5798,[36]2.6357,[37]2.6692,[38]2.7190,[39]2.7605,[40]2.8239,[41]2.8673,[42]2.8753,[43]2.9274,[44]2.9418,[45]3.0241,[46]3.0761,[47]3.0411,[48]2.9954,[49]2.9720,[50]2.9965,[51]3.0450,[52]3.0606,[53]3.1138,[54]3.1304,[55]3.1600,[56]3.1970,[57]3.2131,[58]3.2561,[59]3.2645,[60]3.3166,[61]3.3573,[62]3.4157,[63]3.4524,[64]3.4987,[65]3.5063,[66]3.4949,[67]3.4740,[68]3.5101,[69]3.5120,[70]3.5317,[71]3.5477,[72]3.5616,[73]3.5728,[74]3.5932,[75]3.5705,[76]3.5180,[77]3.4777,[78]3.4751,[79]3.4568,[80]3.4439,[81]3.4042,[82]3.4112,[83]3.3874,[84]3.3539,[85]3.3213,[86]3.2985,[87]3.3058,[88]3.2793,[89]3.2703,[90]3.2456,[91]3.2217,[92]3.1996,[93]3.1747,[94]3.1517,[95]3.1352,[96]3.1383,[97]3.1483,[98]3.1361,[99]3.1177,[100]3.1197,[101]3.1118,[102]3.1278,[103]3.1563,[104]3.1767,[105]3.1733,[106]3.2008,[107]3.2254,[108]3.2456,[109]3.2812,[110]3.3161,[111]3.3382,[112]3.3082,[113]3.2952,[114]3.2755,[115]3.2586,[116]3.2518,[117]3.2286,[118]3.2061,[119]3.1864,[120]3.1644,[121]3.1488,[122]3.1277,[123]3.1089,[124]3.0897,[125]3.0718,[126]3.0538,[127]3.0404,[128]3.0348,[129]3.0265,[130]3.0165,[131]3.0092,[132]3.0150,[133]3.0224,[134]3.0265,[135]3.0378,[136]3.0561,[137]3.0727,[138]3.0800,[139]3.0907,[140]3.0892,[141]3.0880,[142]3.0845,[143]3.0826,[144]3.0758,[145]3.0663,[146]3.0631,[147]3.0662,[148]3.0649,[149]3.0643,[150]3.0564,[151]3.0524,[152]3.0471,[153]3.0411,[154]3.0400,[155]3.0432,[156]3.0431,[157]3.0477,[158]3.0567,[159]3.0579,[160]3.0669,[161]3.0749,[162]3.0838,[163]3.0901,[164]3.1119,[165]3.1367,[166]3.1548,[167]3.1696,[168]3.1962,[169]3.2196,[170]3.2420,[171]3.2661,[172]3.2467,[173]3.2266,[174]3.2125,[175]3.1996,[176]3.1862,[177]3.1753,[178]3.1621,[179]3.1475,[180]3.1508,[181]3.1650,[182]3.1807,[183]3.1952,[184]3.2096,[185]3.2197,[186]3.2367,[187]3.2520,[188]3.2670,[189]3.2774,[190]3.2771,[191]3.2836,[192]3.2861,[193]3.2902,[194]3.3108,[195]3.3200,[196]3.3329,[197]3.3423,[198]3.3456,[199]3.3513,[200]3.3487,[201]3.3644,[202]3.3578,[203]3.3627,[204]3.3650,[205]3.3660,[206]3.3680,[207]3.3772,[208]3.3868,[209]3.3968,[210]3.3965,[211]3.3901,[212]3.3888,[213]3.3963,[214]3.3974,[215]3.4026,[216]3.4023,[217]3.3963,[218]3.3952,[219]3.3949,[220]3.3928,[221]3.3922,[222]3.3914,[223]3.3920,[224]3.3971,[225]3.3990,[226]3.3893,[227]3.3880,[228]3.3893,[229]3.3934,[230]3.3995,[231]3.4054,[232]3.3962,[233]3.3892,[234]3.3920,[235]3.3917,[236]3.4013,[237]3.4105,[238]3.4201,[239]3.4303,[240]3.4394,[241]3.4509,[242]3.4661,[243]3.4791,[244]3.4880,[245]3.5000,[246]3.5109,[247]3.5084,[248]3.5043,[249]3.5017,[250]3.4936,[251]3.4902,[252]3.4911,[253]3.4942,[254]3.5007,[255]3.5065,[256]3.5093,[257]3.5113,[258]3.5115,[259]3.5142,[260]3.5159,[261]3.5164,[262]3.5145,[263]3.5205,[264]3.5225,[265]3.5218,[266]3.5235,[267]3.5258,[268]3.5298,[269]3.5330,[270]3.5310,[271]3.5287,[272]3.5208,[273]3.5217,[274]3.5154,[275]3.5044,[276]3.4937,[277]3.4956,[278]3.5066,[279]3.5124,[280]3.5204,[281]3.5275,[282]3.5336,[283]3.5407,[284]3.5479,[285]3.5618,[286]3.5638,[287]3.5661,[288]3.5702,[289]3.5723,[290]3.5640,[291]3.5573,[292]3.5601,[293]3.5595,[294]3.5590,[295]3.5572,[296]3.5593,[297]3.5607,[298]3.5658,[299]3.5727,[300]3.5756,[301]3.5796,[302]3.5822,[303]3.5835,[304]3.5817,[305]3.5937,[306]3.6013,[307]3.6130,[308]3.6006,[309]3.5950,[310]3.5858,[311]3.5906,[312]3.5932,[313]3.6006,[314]3.6025,[315]3.6052,[316]3.6060,[317]3.6070,[318]3.6071,[319]3.6076,[320]3.6119,[321]3.6119,[322]3.6134,[323]3.6199,[324]3.6201,[325]3.6247,[326]3.6300,[327]3.6338,[328]3.6362,[329]3.6374,[330]3.6436,[331]3.6484,[332]3.6528,[333]3.6507,[334]3.6496,[335]3.6493,[336]3.6487,[337]3.6491,[338]3.6492,[339]3.6512,[340]3.6547,[341]3.6600,[342]3.6695,[343]3.6796,[344]3.6848,[345]3.6765,[346]3.6696,[347]3.6677,[348]3.6601,[349]3.6564,[350]3.6545,[351]3.6590,[352]3.6751,[353]3.6840,[354]3.6979,[355]3.7068,[356]3.7124,[357]3.7248,[358]3.7354,[359]3.7387,[360]3.7452,[361]3.7545,[362]3.7639,[363]3.7694,[364]3.7756,[365]3.7823,[366]3.7938,[367]3.8022,[368]3.8094,[369]3.8175,[370]3.8264,[371]3.8411,[372]3.8507,[373]3.8534,[374]3.8566,[375]3.8612,[376]3.8748,[377]3.8859,[378]3.8879,[379]3.8866,[380]3.8829,[381]3.8870,[382]3.8927,[383]3.8964,[384]3.9009,[385]3.9048,[386]3.9115,[387]3.9175,[388]3.9207,[389]3.9090,[390]3.8992,[391]3.8885,[392]3.8827,[393]3.8740,[394]3.8651,[395]3.8553,[396]3.8447,[397]3.8354,[398]3.8246,[399]3.8137,[400]3.8050,[401]3.7938,[402]3.7825,[403]3.7724,[404]3.7607,[405]3.7501,[406]3.7389,[407]3.7288,[408]3.7196,[409]3.7106,[410]3.7044,[411]3.7062,[412]3.7017,[413]3.7045,[414]3.7075,[415]3.7046,[416]3.7048,[417]3.7074,[418]3.7013,[419]3.7033,[420]3.7006,[421]3.6995,[422]3.7013,[423]3.7008,[424]3.7054,[425]3.7051,[426]3.7051,[427]3.7042,[428]3.7072,[429]3.7086,[430]3.7119,[431]3.7130,[432]3.7119,[433]3.7080,[434]3.7090,[435]3.7024,[436]3.6967,[437]3.6930,[438]3.6911,[439]3.6894,[440]3.6946,[441]3.6996,[442]3.7070,[443]3.7049,[444]3.7051,[445]3.7062,[446]3.7114,[447]3.7139,[448]3.7160,[449]3.7188,[450]3.7230,[451]3.7264,[452]3.7286,[453]3.7301,[454]3.7282,[455]3.7304,[456]3.7301,[457]3.7328,[458]3.7378,[459]3.7382,[460]3.7377,[461]3.7339,[462]3.7376,[463]3.7451,[464]3.7509,[465]3.7444,[466]3.7430,[467]3.7421,[468]3.7442,[469]3.7417,[470]3.7389,[471]3.7392,[472]3.7403,[473]3.7394,[474]3.7383,[475]3.7398,[476]3.7378,[477]3.7367,[478]3.7376,[479]3.7398,[480]3.7420,[481]3.7381,[482]3.7415,[483]3.7402,[484]3.7436,[485]3.7502,[486]3.7532,[487]3.7565,[488]3.7623,[489]3.7642,[490]3.7687,[491]3.7748,[492]3.7793,[493]3.7789,[494]3.7798,[495]3.7820,[496]3.7838,[497]3.7869,[498]3.7871,[499]3.7865,[500]3.7901,[501]3.7947,[502]3.7934,[503]3.7912,[504]3.7933,[505]3.7963,[506]3.8046,[507]3.8071,[508]3.8105,[509]3.8022,[510]3.7980,[511]3.7914,[512]3.7870,[513]3.7813,[514]3.7809,[515]3.7836,[516]3.7793,[517]3.7794,[518]3.7790,[519]3.7798,[520]3.7846,[521]3.7831,[522]3.7814,[523]3.7880,[524]3.7868,[525]3.7852,[526]3.7814,[527]3.7752,[528]3.7717,[529]3.7682,[530]3.7650,[531]3.7615,[532]3.7545,[533]3.7481,[534]3.7443,[535]3.7456,[536]3.7485,[537]3.7524,[538]3.7561,[539]3.7590,[540]3.7645,[541]3.7680,[542]3.7704,[543]3.7656,[544]3.7619,[545]3.7613,[546]3.7538,[547]3.7477,[548]3.7409,[549]3.7342,[550]3.7282,[551]3.7222,[552]3.7165,[553]3.7113,[554]3.7108,[555]3.7094,[556]3.7121,[557]3.7164,[558]3.7226,[559]3.7273,[560]3.7330,[561]3.7305,
Final estimate: PPL = 3.7305 +/- 0.02118

llama_print_timings:        load time =    9810.20 ms
llama_print_timings:      sample time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time = 2166647.49 ms / 287232 tokens (    7.54 ms per token,   132.57 tokens per second)
llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time = 2170176.48 ms / 287233 tokens

## again with -ser 5,1
perplexity: tokenizing the input ..
perplexity: tokenization took 607.579 ms
perplexity: calculating perplexity over 561 chunks, n_ctx=512, batch_size=2048, n_seq=4
perplexity: 14.10 seconds per pass - ETA 32.95 minutes
[1]2.6830,[2]3.4757,[3]2.4956,[4]2.1153,[5]1.9387,[6]1.8172,[7]1.7104,[8]1.6689,[9]1.6385,[10]1.5935,[11]1.5975,[12]1.6683,[13]1.6956,[14]1.8311,[15]1.9839,[16]2.0386,[17]2.2173,[18]2.3501,[19]2.3057,[20]2.2880,[21]2.4071,[22]2.3703,[23]2.3309,[24]2.3495,[25]2.3106,[26]2.2796,[27]2.3271,[28]2.3352,[29]2.3927,[30]2.4247,[31]2.4685,[32]2.4886,[33]2.5350,[34]2.5831,[35]2.6447,[36]2.7047,[37]2.7373,[38]2.7885,[39]2.8292,[40]2.8929,[41]2.9324,[42]2.9404,[43]2.9917,[44]3.0038,[45]3.0875,[46]3.1397,[47]3.1067,[48]3.0629,[49]3.0412,[50]3.0654,[51]3.1151,[52]3.1300,[53]3.1847,[54]3.2018,[55]3.2332,[56]3.2701,[57]3.2880,[58]3.3306,[59]3.3381,[60]3.3905,[61]3.4318,[62]3.4917,[63]3.5281,[64]3.5750,[65]3.5844,[66]3.5767,[67]3.5584,[68]3.5947,[69]3.5962,[70]3.6180,[71]3.6343,[72]3.6481,[73]3.6594,[74]3.6812,[75]3.6589,[76]3.6034,[77]3.5623,[78]3.5605,[79]3.5415,[80]3.5290,[81]3.4895,[82]3.4956,[83]3.4736,[84]3.4393,[85]3.4083,[86]3.3866,[87]3.3964,[88]3.3691,[89]3.3597,[90]3.3349,[91]3.3109,[92]3.2869,[93]3.2642,[94]3.2418,[95]3.2256,[96]3.2276,[97]3.2380,[98]3.2244,[99]3.2081,[100]3.2099,[101]3.2009,[102]3.2179,[103]3.2462,[104]3.2681,[105]3.2647,[106]3.2941,[107]3.3188,[108]3.3387,[109]3.3750,[110]3.4104,[111]3.4332,[112]3.4029,[113]3.3898,[114]3.3697,[115]3.3519,[116]3.3468,[117]3.3244,[118]3.3007,[119]3.2811,[120]3.2597,[121]3.2429,[122]3.2212,[123]3.2017,[124]3.1820,[125]3.1638,[126]3.1461,[127]3.1339,[128]3.1291,[129]3.1218,[130]3.1121,[131]3.1057,[132]3.1129,[133]3.1208,[134]3.1263,[135]3.1380,[136]3.1564,[137]3.1732,[138]3.1802,[139]3.1912,[140]3.1892,[141]3.1872,[142]3.1827,[143]3.1799,[144]3.1723,[145]3.1624,[146]3.1591,[147]3.1619,[148]3.1597,[149]3.1587,[150]3.1496,[151]3.1454,[152]3.1394,[153]3.1328,[154]3.1312,[155]3.1343,[156]3.1337,[157]3.1383,[158]3.1474,[159]3.1488,[160]3.1576,[161]3.1651,[162]3.1739,[163]3.1800,[164]3.2023,[165]3.2275,[166]3.2462,[167]3.2601,[168]3.2868,[169]3.3099,[170]3.3323,[171]3.3567,[172]3.3367,[173]3.3164,[174]3.3017,[175]3.2902,[176]3.2771,[177]3.2670,[178]3.2535,[179]3.2393,[180]3.2429,[181]3.2571,[182]3.2732,[183]3.2874,[184]3.3014,[185]3.3122,[186]3.3295,[187]3.3446,[188]3.3599,[189]3.3705,[190]3.3696,[191]3.3765,[192]3.3786,[193]3.3824,[194]3.4032,[195]3.4122,[196]3.4251,[197]3.4347,[198]3.4375,[199]3.4438,[200]3.4407,[201]3.4567,[202]3.4494,[203]3.4545,[204]3.4569,[205]3.4574,[206]3.4587,[207]3.4683,[208]3.4772,[209]3.4874,[210]3.4869,[211]3.4797,[212]3.4785,[213]3.4861,[214]3.4870,[215]3.4923,[216]3.4914,[217]3.4849,[218]3.4840,[219]3.4835,[220]3.4817,[221]3.4806,[222]3.4792,[223]3.4798,[224]3.4851,[225]3.4867,[226]3.4768,[227]3.4749,[228]3.4761,[229]3.4794,[230]3.4856,[231]3.4916,[232]3.4821,[233]3.4752,[234]3.4783,[235]3.4784,[236]3.4883,[237]3.4971,[238]3.5062,[239]3.5170,[240]3.5263,[241]3.5383,[242]3.5543,[243]3.5684,[244]3.5778,[245]3.5897,[246]3.6008,[247]3.5980,[248]3.5934,[249]3.5902,[250]3.5814,[251]3.5777,[252]3.5788,[253]3.5821,[254]3.5884,[255]3.5943,[256]3.5970,[257]3.5990,[258]3.5989,[259]3.6015,[260]3.6031,[261]3.6035,[262]3.6012,[263]3.6072,[264]3.6090,[265]3.6087,[266]3.6106,[267]3.6128,[268]3.6166,[269]3.6194,[270]3.6171,[271]3.6147,[272]3.6056,[273]3.6071,[274]3.6006,[275]3.5897,[276]3.5795,[277]3.5817,[278]3.5930,[279]3.5989,[280]3.6071,[281]3.6147,[282]3.6212,[283]3.6288,[284]3.6360,[285]3.6504,[286]3.6522,[287]3.6548,[288]3.6587,[289]3.6605,[290]3.6523,[291]3.6454,[292]3.6481,[293]3.6476,[294]3.6476,[295]3.6457,[296]3.6483,[297]3.6491,[298]3.6545,[299]3.6611,[300]3.6639,[301]3.6679,[302]3.6708,[303]3.6722,[304]3.6700,[305]3.6824,[306]3.6901,[307]3.7024,[308]3.6897,[309]3.6841,[310]3.6748,[311]3.6796,[312]3.6824,[313]3.6903,[314]3.6917,[315]3.6941,[316]3.6951,[317]3.6964,[318]3.6963,[319]3.6963,[320]3.7011,[321]3.7008,[322]3.7018,[323]3.7083,[324]3.7083,[325]3.7132,[326]3.7180,[327]3.7228,[328]3.7249,[329]3.7262,[330]3.7325,[331]3.7374,[332]3.7421,[333]3.7397,[334]3.7384,[335]3.7381,[336]3.7373,[337]3.7375,[338]3.7375,[339]3.7392,[340]3.7424,[341]3.7477,[342]3.7571,[343]3.7674,[344]3.7732,[345]3.7656,[346]3.7595,[347]3.7577,[348]3.7500,[349]3.7461,[350]3.7443,[351]3.7491,[352]3.7655,[353]3.7748,[354]3.7888,[355]3.7978,[356]3.8035,[357]3.8162,[358]3.8266,[359]3.8295,[360]3.8362,[361]3.8455,[362]3.8548,[363]3.8607,[364]3.8666,[365]3.8735,[366]3.8853,[367]3.8941,[368]3.9014,[369]3.9097,[370]3.9182,[371]3.9331,[372]3.9430,[373]3.9457,[374]3.9491,[375]3.9535,[376]3.9673,[377]3.9784,[378]3.9803,[379]3.9791,[380]3.9754,[381]3.9794,[382]3.9849,[383]3.9887,[384]3.9933,[385]3.9970,[386]4.0037,[387]4.0098,[388]4.0131,[389]4.0013,[390]3.9915,[391]3.9804,[392]3.9748,[393]3.9663,[394]3.9575,[395]3.9481,[396]3.9370,[397]3.9280,[398]3.9172,[399]3.9061,[400]3.8974,[401]3.8860,[402]3.8745,[403]3.8643,[404]3.8524,[405]3.8414,[406]3.8296,[407]3.8191,[408]3.8097,[409]3.8006,[410]3.7943,[411]3.7961,[412]3.7921,[413]3.7947,[414]3.7976,[415]3.7945,[416]3.7950,[417]3.7982,[418]3.7921,[419]3.7941,[420]3.7913,[421]3.7901,[422]3.7917,[423]3.7912,[424]3.7959,[425]3.7956,[426]3.7953,[427]3.7944,[428]3.7974,[429]3.7985,[430]3.8016,[431]3.8023,[432]3.8011,[433]3.7969,[434]3.7978,[435]3.7909,[436]3.7853,[437]3.7815,[438]3.7793,[439]3.7775,[440]3.7831,[441]3.7881,[442]3.7956,[443]3.7933,[444]3.7935,[445]3.7942,[446]3.7993,[447]3.8019,[448]3.8041,[449]3.8064,[450]3.8106,[451]3.8140,[452]3.8162,[453]3.8180,[454]3.8158,[455]3.8178,[456]3.8176,[457]3.8202,[458]3.8254,[459]3.8258,[460]3.8249,[461]3.8211,[462]3.8246,[463]3.8320,[464]3.8378,[465]3.8311,[466]3.8299,[467]3.8291,[468]3.8314,[469]3.8288,[470]3.8260,[471]3.8262,[472]3.8274,[473]3.8264,[474]3.8252,[475]3.8266,[476]3.8244,[477]3.8232,[478]3.8247,[479]3.8268,[480]3.8294,[481]3.8253,[482]3.8287,[483]3.8271,[484]3.8303,[485]3.8367,[486]3.8398,[487]3.8433,[488]3.8490,[489]3.8508,[490]3.8555,[491]3.8619,[492]3.8663,[493]3.8663,[494]3.8674,[495]3.8694,[496]3.8712,[497]3.8744,[498]3.8743,[499]3.8737,[500]3.8771,[501]3.8816,[502]3.8804,[503]3.8779,[504]3.8800,[505]3.8830,[506]3.8912,[507]3.8936,[508]3.8972,[509]3.8887,[510]3.8849,[511]3.8786,[512]3.8741,[513]3.8681,[514]3.8672,[515]3.8700,[516]3.8659,[517]3.8660,[518]3.8658,[519]3.8667,[520]3.8716,[521]3.8700,[522]3.8683,[523]3.8753,[524]3.8744,[525]3.8725,[526]3.8689,[527]3.8627,[528]3.8593,[529]3.8558,[530]3.8524,[531]3.8487,[532]3.8416,[533]3.8349,[534]3.8315,[535]3.8326,[536]3.8353,[537]3.8394,[538]3.8435,[539]3.8464,[540]3.8524,[541]3.8558,[542]3.8583,[543]3.8540,[544]3.8505,[545]3.8501,[546]3.8424,[547]3.8364,[548]3.8295,[549]3.8224,[550]3.8164,[551]3.8104,[552]3.8046,[553]3.7992,[554]3.7993,[555]3.7979,[556]3.8006,[557]3.8049,[558]3.8112,[559]3.8159,[560]3.8216,[561]3.8189,
Final estimate: PPL = 3.8189 +/- 0.02171

llama_print_timings:        load time =    9779.02 ms
llama_print_timings:      sample time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time = 1940210.95 ms / 287232 tokens (    6.75 ms per token,   148.04 tokens per second)
llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time = 1943740.46 ms / 287233 tokens

## again with -ser 7,1
perplexity: tokenizing the input ..
perplexity: tokenization took 643.261 ms
perplexity: calculating perplexity over 561 chunks, n_ctx=512, batch_size=2048, n_seq=4
perplexity: 17.39 seconds per pass - ETA 40.65 minutes
[1]2.6392,[2]3.4663,[3]2.4744,[4]2.0865,[5]1.9050,[6]1.7817,[7]1.6767,[8]1.6264,[9]1.5874,[10]1.5396,[11]1.5359,[12]1.5994,[13]1.6198,[14]1.7544,[15]1.8973,[16]1.9543,[17]2.1251,[18]2.2555,[19]2.2165,[20]2.2059,[21]2.3162,[22]2.2807,[23]2.2487,[24]2.2607,[25]2.2276,[26]2.1968,[27]2.2454,[28]2.2572,[29]2.3090,[30]2.3405,[31]2.3812,[32]2.4012,[33]2.4438,[34]2.4915,[35]2.5495,[36]2.6048,[37]2.6393,[38]2.6890,[39]2.7297,[40]2.7933,[41]2.8382,[42]2.8479,[43]2.9002,[44]2.9142,[45]2.9968,[46]3.0486,[47]3.0113,[48]2.9637,[49]2.9420,[50]2.9654,[51]3.0145,[52]3.0313,[53]3.0853,[54]3.1012,[55]3.1321,[56]3.1682,[57]3.1823,[58]3.2248,[59]3.2321,[60]3.2823,[61]3.3229,[62]3.3765,[63]3.4111,[64]3.4569,[65]3.4644,[66]3.4514,[67]3.4316,[68]3.4678,[69]3.4693,[70]3.4852,[71]3.5018,[72]3.5164,[73]3.5284,[74]3.5502,[75]3.5286,[76]3.4770,[77]3.4378,[78]3.4341,[79]3.4135,[80]3.4004,[81]3.3619,[82]3.3706,[83]3.3457,[84]3.3122,[85]3.2805,[86]3.2571,[87]3.2615,[88]3.2350,[89]3.2276,[90]3.2025,[91]3.1788,[92]3.1552,[93]3.1294,[94]3.1079,[95]3.0899,[96]3.0916,[97]3.0997,[98]3.0887,[99]3.0710,[100]3.0725,[101]3.0650,[102]3.0820,[103]3.1103,[104]3.1317,[105]3.1281,[106]3.1544,[107]3.1789,[108]3.1998,[109]3.2355,[110]3.2700,[111]3.2921,[112]3.2632,[113]3.2498,[114]3.2292,[115]3.2128,[116]3.2061,[117]3.1829,[118]3.1616,[119]3.1423,[120]3.1206,[121]3.1059,[122]3.0852,[123]3.0665,[124]3.0471,[125]3.0289,[126]3.0109,[127]2.9971,[128]2.9924,[129]2.9836,[130]2.9734,[131]2.9656,[132]2.9724,[133]2.9806,[134]2.9854,[135]2.9966,[136]3.0146,[137]3.0308,[138]3.0382,[139]3.0493,[140]3.0483,[141]3.0475,[142]3.0444,[143]3.0431,[144]3.0362,[145]3.0261,[146]3.0228,[147]3.0255,[148]3.0242,[149]3.0242,[150]3.0166,[151]3.0126,[152]3.0077,[153]3.0019,[154]3.0012,[155]3.0044,[156]3.0049,[157]3.0096,[158]3.0182,[159]3.0192,[160]3.0282,[161]3.0365,[162]3.0456,[163]3.0515,[164]3.0728,[165]3.0971,[166]3.1149,[167]3.1290,[168]3.1550,[169]3.1779,[170]3.1994,[171]3.2232,[172]3.2041,[173]3.1846,[174]3.1711,[175]3.1587,[176]3.1460,[177]3.1348,[178]3.1216,[179]3.1073,[180]3.1105,[181]3.1247,[182]3.1406,[183]3.1551,[184]3.1695,[185]3.1793,[186]3.1961,[187]3.2114,[188]3.2263,[189]3.2365,[190]3.2364,[191]3.2432,[192]3.2455,[193]3.2494,[194]3.2696,[195]3.2793,[196]3.2925,[197]3.3020,[198]3.3051,[199]3.3105,[200]3.3081,[201]3.3239,[202]3.3179,[203]3.3232,[204]3.3256,[205]3.3260,[206]3.3277,[207]3.3366,[208]3.3465,[209]3.3566,[210]3.3560,[211]3.3497,[212]3.3488,[213]3.3563,[214]3.3575,[215]3.3631,[216]3.3630,[217]3.3574,[218]3.3565,[219]3.3566,[220]3.3549,[221]3.3545,[222]3.3541,[223]3.3547,[224]3.3597,[225]3.3615,[226]3.3520,[227]3.3506,[228]3.3522,[229]3.3562,[230]3.3624,[231]3.3685,[232]3.3590,[233]3.3520,[234]3.3549,[235]3.3552,[236]3.3645,[237]3.3734,[238]3.3831,[239]3.3936,[240]3.4023,[241]3.4141,[242]3.4296,[243]3.4427,[244]3.4513,[245]3.4632,[246]3.4738,[247]3.4713,[248]3.4672,[249]3.4649,[250]3.4570,[251]3.4537,[252]3.4549,[253]3.4578,[254]3.4644,[255]3.4702,[256]3.4732,[257]3.4754,[258]3.4757,[259]3.4781,[260]3.4798,[261]3.4804,[262]3.4783,[263]3.4841,[264]3.4862,[265]3.4857,[266]3.4876,[267]3.4900,[268]3.4939,[269]3.4968,[270]3.4949,[271]3.4925,[272]3.4846,[273]3.4856,[274]3.4794,[275]3.4687,[276]3.4590,[277]3.4607,[278]3.4718,[279]3.4774,[280]3.4855,[281]3.4927,[282]3.4986,[283]3.5056,[284]3.5126,[285]3.5268,[286]3.5292,[287]3.5318,[288]3.5360,[289]3.5381,[290]3.5297,[291]3.5227,[292]3.5246,[293]3.5242,[294]3.5236,[295]3.5216,[296]3.5240,[297]3.5254,[298]3.5305,[299]3.5374,[300]3.5404,[301]3.5446,[302]3.5470,[303]3.5480,[304]3.5461,[305]3.5583,[306]3.5655,[307]3.5769,[308]3.5646,[309]3.5591,[310]3.5501,[311]3.5548,[312]3.5580,[313]3.5652,[314]3.5670,[315]3.5698,[316]3.5707,[317]3.5720,[318]3.5722,[319]3.5725,[320]3.5769,[321]3.5770,[322]3.5785,[323]3.5849,[324]3.5853,[325]3.5900,[326]3.5948,[327]3.5986,[328]3.6009,[329]3.6023,[330]3.6085,[331]3.6134,[332]3.6180,[333]3.6159,[334]3.6149,[335]3.6146,[336]3.6140,[337]3.6145,[338]3.6145,[339]3.6167,[340]3.6202,[341]3.6257,[342]3.6349,[343]3.6448,[344]3.6498,[345]3.6419,[346]3.6350,[347]3.6328,[348]3.6249,[349]3.6209,[350]3.6193,[351]3.6241,[352]3.6398,[353]3.6486,[354]3.6622,[355]3.6711,[356]3.6768,[357]3.6890,[358]3.6995,[359]3.7026,[360]3.7092,[361]3.7183,[362]3.7276,[363]3.7332,[364]3.7395,[365]3.7463,[366]3.7577,[367]3.7663,[368]3.7733,[369]3.7814,[370]3.7902,[371]3.8046,[372]3.8141,[373]3.8168,[374]3.8200,[375]3.8245,[376]3.8377,[377]3.8488,[378]3.8510,[379]3.8499,[380]3.8463,[381]3.8505,[382]3.8562,[383]3.8599,[384]3.8644,[385]3.8682,[386]3.8749,[387]3.8807,[388]3.8838,[389]3.8723,[390]3.8624,[391]3.8519,[392]3.8461,[393]3.8373,[394]3.8284,[395]3.8192,[396]3.8083,[397]3.7990,[398]3.7885,[399]3.7776,[400]3.7689,[401]3.7578,[402]3.7465,[403]3.7367,[404]3.7251,[405]3.7145,[406]3.7033,[407]3.6934,[408]3.6843,[409]3.6751,[410]3.6690,[411]3.6709,[412]3.6667,[413]3.6695,[414]3.6725,[415]3.6699,[416]3.6702,[417]3.6727,[418]3.6666,[419]3.6682,[420]3.6656,[421]3.6645,[422]3.6661,[423]3.6656,[424]3.6699,[425]3.6693,[426]3.6695,[427]3.6686,[428]3.6715,[429]3.6730,[430]3.6760,[431]3.6770,[432]3.6760,[433]3.6721,[434]3.6730,[435]3.6666,[436]3.6609,[437]3.6574,[438]3.6554,[439]3.6539,[440]3.6591,[441]3.6641,[442]3.6716,[443]3.6695,[444]3.6698,[445]3.6709,[446]3.6760,[447]3.6784,[448]3.6809,[449]3.6835,[450]3.6875,[451]3.6911,[452]3.6935,[453]3.6950,[454]3.6930,[455]3.6951,[456]3.6949,[457]3.6973,[458]3.7023,[459]3.7026,[460]3.7022,[461]3.6982,[462]3.7019,[463]3.7091,[464]3.7150,[465]3.7085,[466]3.7072,[467]3.7065,[468]3.7085,[469]3.7060,[470]3.7033,[471]3.7035,[472]3.7045,[473]3.7038,[474]3.7026,[475]3.7040,[476]3.7021,[477]3.7011,[478]3.7019,[479]3.7039,[480]3.7062,[481]3.7024,[482]3.7060,[483]3.7046,[484]3.7080,[485]3.7146,[486]3.7175,[487]3.7210,[488]3.7266,[489]3.7286,[490]3.7332,[491]3.7393,[492]3.7437,[493]3.7435,[494]3.7445,[495]3.7468,[496]3.7485,[497]3.7517,[498]3.7516,[499]3.7509,[500]3.7546,[501]3.7590,[502]3.7577,[503]3.7556,[504]3.7581,[505]3.7609,[506]3.7694,[507]3.7719,[508]3.7754,[509]3.7672,[510]3.7628,[511]3.7567,[512]3.7522,[513]3.7464,[514]3.7458,[515]3.7487,[516]3.7445,[517]3.7447,[518]3.7440,[519]3.7449,[520]3.7497,[521]3.7481,[522]3.7462,[523]3.7527,[524]3.7515,[525]3.7499,[526]3.7462,[527]3.7402,[528]3.7371,[529]3.7336,[530]3.7307,[531]3.7272,[532]3.7204,[533]3.7139,[534]3.7102,[535]3.7115,[536]3.7145,[537]3.7184,[538]3.7216,[539]3.7244,[540]3.7301,[541]3.7338,[542]3.7364,[543]3.7313,[544]3.7275,[545]3.7269,[546]3.7196,[547]3.7134,[548]3.7066,[549]3.7000,[550]3.6942,[551]3.6884,[552]3.6828,[553]3.6777,[554]3.6773,[555]3.6761,[556]3.6787,[557]3.6829,[558]3.6891,[559]3.6938,[560]3.6994,[561]3.6968,
Final estimate: PPL = 3.6968 +/- 0.02105

llama_print_timings:        load time =   10199.69 ms
llama_print_timings:      sample time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time = 2403207.35 ms / 287232 tokens (    8.37 ms per token,   119.52 tokens per second)
llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time = 2406766.55 ms / 287233 tokens

ubergarm `IQ2_BN_R4`

This is an experimental quant I rolled with q8_0 for all attention/shared experts/embeddings loaded on GPU. The rest of the MoE down exps are iq2_xs_r4 and gate/up exps are iq2_bn_r4. However, perplexity looks pretty bad. So I'll likely aim for larger sized model with higher quality quants and make-up speed/accuracy trade off exploring -ser instead of going very small quants.

Looking back on it with advise from the team, bitnet quants are very fast to compute, but only good quality for models trained specifically as a ternary bit-net. So this is not the correct use-case.

This was run on ik_llama.cpp@127c6ee6

CUDA_VISIBLE_DEVICES="0," \
./build/bin/llama-perplexity \
    --model /mnt/raid/models/ubergarm/DeepSeek-R1-GGUF/DeepSeek-R1-IQ2_BN_R4.gguf \
    -ctk q8_0 \
    -mla 2 -fa \
    -amb 512 \
    -fmoe \
    --ctx-size 512 \
    --ubatch-size 512 \
    -f wiki.test.raw \
    --n-gpu-layers 63 \
    --override-tensor exps=CPU \
    --threads 24

main: build = 3597 (127c6ee6)
main: built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
main: seed  = 1742438479

llama_model_loader: - type  f32:  361 tensors
llama_model_loader: - type q8_0:  612 tensors
llama_model_loader: - type iq2_xs_r4:   58 tensors
llama_model_loader: - type iq2_bn_r4:  116 Tensors

system_info: n_threads = 24 / 48 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
perplexity: tokenizing the input ..
perplexity: tokenization took 561.456 ms
perplexity: calculating perplexity over 561 chunks, n_ctx=512, batch_size=2048, n_seq=4
perplexity: 18.96 seconds per pass - ETA 44.30 minutes
[1]30.4651,[2]41.3702,[3]59.6912,[4]63.7281,[5]69.4759,[6]74.5164,[7]78.4960,[8]83.2716,[9]91.6114,[10]92.0761,[11]93.4731,[12]97.5649,[13]103.3701,[14]98.6315,[15]101.2792,[16]92.5897,[17]94.2696,[18]95.8584,[19]98.7396,[20]95.6990,[21]93.2173,[22]88.2120,[23]80.9408,[24]79.5825,[25]75.1830,[26]73.4152,[27]73.7467,[28]72.5897,[29]73.6461,[30]71.2716,[31]70.8169,[32]71.0411,[33]71.9739,[34]73.1812,[35]74.9429,[36]75.9408,[37]74.4652,[38]75.1052,[39]75.1191,[40]75.3918,[41]75.9498,[42]75.0217,[43]75.3187,[44]73.9763,[45]74.7156,[46]74.6030,[47]73.8118,[48]73.4332,[49]72.7741,[50]73.2112,[51]73.5430,[52]73.1248,[53]73.7724,[54]73.3190,[55]73.3087,[56]73.3568,[57]72.9256,[58]73.3320,[59]72.7841,[60]73.7844,[61]74.8152,[62]75.6196,[63]76.1783,[64]76.8785,[65]76.2970,[66]75.9790,[67]75.8581,[68]76.0077,[69]76.2337,[70]76.4732,[71]76.8328,[72]76.5038,[73]76.6703,[74]76.7263,[75]75.3965,[76]75.0320,[77]74.3497,[78]74.5668,[79]74.8424,[80]74.6498,[81]74.7401,[82]75.1574,[83]75.3660,[84]75.3174,[85]74.9314,[86]74.5937,[87]75.7275,[88]75.4835,[89]75.3029,[90]75.3806,[91]74.8898,[92]74.6847,[93]74.2882,[94]74.7222,[95]74.6123,[96]75.0049,[97]75.3071,[98]75.1735,[99]75.6399,[100]75.1926,[101]75.5885,[102]75.5438,[103]75.5805,[104]75.9626,[105]76.5854,[106]77.2787,[107]77.4046,[108]77.6250,[109]78.5008,[110]79.0834,[111]79.3914,[112]78.8812,[113]78.6738,[114]78.7153,[115]78.5561,[116]78.6442,[117]78.1482,[118]77.5726,[119]76.8977,[120]76.4276,[121]76.4281,[122]75.9297,[123]75.8329,[124]75.7454,[125]75.0345,[126]74.3182,[127]74.3376,[128]74.2819,[129]74.4231,[130]74.4475,[131]74.1864,[132]74.2024,[133]74.1325,[134]74.3007,[135]74.3278,[136]74.2061,[137]74.0316,[138]73.8620,[139]73.8160,[140]72.9537,[141]72.5497,[142]72.4046,[143]72.2079,[144]71.4530,[145]71.1845,[146]71.0542,[147]71.0027,[148]70.5053,[149]70.3279,[150]69.9599,[151]69.9437,[152]69.8039,[153]69.4855,[154]69.2991,[155]69.3639,[156]69.4526,[157]69.5932,[158]69.5653,[159]69.7948,[160]69.7201,[161]69.6685,[162]69.6460,[163]70.2213,[164]70.5881,[165]70.9379,[166]71.2368,[167]71.3472,[168]71.8189,[169]72.0481,[170]72.5595,[171]72.9830,[172]73.1128,[173]73.1918,[174]73.7032,[175]73.8460,[176]74.1501,[177]74.3805,[178]74.5088,[179]74.7271,[180]75.0349,[181]75.2392,[182]75.3930,[183]75.4962,[184]75.6980,[185]75.7017,[186]75.9172,[187]76.1569,[188]76.3392,[189]76.5035,[190]76.4001,[191]76.0507,[192]75.7021,[193]75.7208,[194]75.8537,[195]76.0376,[196]76.0778,[197]76.1313,[198]75.8537,[199]75.9918,[200]75.4142,[201]75.5213,[202]75.5615,[203]75.2912,[204]74.9822,[205]74.8085,[206]74.5319,[207]74.6603,[208]74.7784,[209]74.7338,[210]74.3459,[211]74.0537,[212]73.9633,[213]73.8683,[214]73.6936,[215]73.7491,[216]73.5260,[217]73.3379,[218]73.2290,[219]73.1061,[220]72.7115,[221]72.4290,[222]72.3064,[223]72.2784,[224]72.0623,[225]71.9317,[226]71.5524,[227]71.5180,[228]71.3948,[229]71.4077,[230]71.3968,[231]71.1918,[232]71.1809,[233]71.2559,[234]71.5151,[235]71.6945,[236]71.8480,[237]72.0458,[238]72.0786,[239]72.2764,[240]72.2934,[241]72.2876,[242]72.4647,[243]72.6715,[244]72.8228,[245]73.1111,[246]73.2691,[247]72.9157,[248]72.8787,[249]72.8196,[250]72.6383,[251]72.7225,[252]72.6816,[253]72.6690,[254]72.8589,[255]72.9280,[256]73.0759,[257]72.9125,[258]72.9499,[259]72.9666,[260]72.9527,[261]73.0663,[262]73.0243,[263]73.1014,[264]73.1146,[265]73.0295,[266]72.9404,[267]73.0977,[268]73.0974,[269]73.1050,[270]73.1464,[271]73.1283,[272]72.9510,[273]73.1206,[274]72.9188,[275]72.6492,[276]72.5276,[277]72.6023,[278]72.7573,[279]72.7637,[280]72.9360,[281]73.1038,[282]73.1992,[283]73.3907,[284]73.5623,[285]73.8527,[286]73.9684,[287]73.7626,[288]73.8129,[289]73.6910,[290]73.7631,[291]73.7001,[292]73.7971,[293]73.8070,[294]73.7912,[295]73.7995,[296]73.7670,[297]73.6427,[298]73.7091,[299]73.7808,[300]73.6593,[301]73.6734,[302]73.7352,[303]73.5537,[304]73.5688,[305]73.7986,[306]73.7752,[307]73.8407,[308]73.9159,[309]73.9887,[310]73.8264,[311]73.9956,[312]74.0235,[313]74.0562,[314]73.9765,[315]73.7744,[316]73.5667,[317]73.4656,[318]73.2387,[319]72.9452,[320]72.8921,[321]72.7795,[322]72.6295,[323]72.7180,[324]72.4026,[325]72.4001,[326]72.4355,[327]72.4267,[328]72.3786,[329]72.2933,[330]72.4264,[331]72.5120,[332]72.5842,[333]72.5420,[334]72.6294,[335]72.6203,[336]72.5613,[337]72.6455,[338]72.7551,[339]72.9121,[340]72.8642,[341]72.9226,[342]73.0614,[343]73.2002,[344]73.3840,[345]73.2675,[346]73.4389,[347]73.4674,[348]73.6170,[349]73.7728,[350]74.0274,[351]74.1304,[352]74.3622,[353]74.5060,[354]74.7099,[355]74.9348,[356]75.1550,[357]75.3012,[358]75.5045,[359]75.7389,[360]75.8253,[361]75.9817,[362]76.0769,[363]76.3016,[364]76.5374,[365]76.7491,[366]76.8622,[367]76.9915,[368]77.1832,[369]77.2848,[370]77.4517,[371]77.6151,[372]77.8006,[373]77.6992,[374]77.6120,[375]77.6728,[376]77.8086,[377]77.9167,[378]77.9694,[379]77.9362,[380]77.9685,[381]78.0667,[382]78.0937,[383]78.0334,[384]78.1167,[385]78.2458,[386]78.3953,[387]78.4767,[388]78.5231,[389]78.6517,[390]78.5486,[391]78.4141,[392]78.4592,[393]78.5561,[394]78.5214,[395]78.5224,[396]78.6534,[397]78.5777,[398]78.5956,[399]78.6529,[400]78.6946,[401]78.6505,[402]78.7588,[403]78.8119,[404]78.8418,[405]78.7557,[406]78.7805,[407]78.7304,[408]78.8406,[409]78.8875,[410]79.0045,[411]79.0516,[412]79.2824,[413]79.3757,[414]79.5010,[415]79.5673,[416]79.6531,[417]79.7945,[418]79.5969,[419]79.6173,[420]79.3900,[421]79.2968,[422]79.3331,[423]79.1822,[424]79.1590,[425]79.0439,[426]79.1252,[427]79.0451,[428]79.0732,[429]78.9041,[430]78.9446,[431]78.9144,[432]78.8635,[433]78.7848,[434]78.8337,[435]78.8372,[436]78.7636,[437]78.7688,[438]78.6158,[439]78.8016,[440]78.8886,[441]78.9032,[442]79.0712,[443]78.9520,[444]79.0125,[445]79.1275,[446]79.2797,[447]79.3779,[448]79.4570,[449]79.4597,[450]79.5344,[451]79.6098,[452]79.7045,[453]79.8331,[454]79.8938,[455]79.9051,[456]79.7348,[457]79.6326,[458]79.7009,[459]79.7852,[460]79.5662,[461]79.4191,[462]79.4215,[463]79.4698,[464]79.6245,[465]79.5234,[466]79.4492,[467]79.4723,[468]79.4396,[469]79.3697,[470]79.2523,[471]79.1118,[472]78.9983,[473]78.8544,[474]78.7369,[475]78.5814,[476]78.5756,[477]78.3830,[478]78.3047,[479]78.2794,[480]78.3264,[481]78.2197,[482]78.2629,[483]78.2012,[484]78.2675,[485]78.3674,[486]78.4736,[487]78.5828,[488]78.5797,[489]78.5999,[490]78.6365,[491]78.7815,[492]78.7504,[493]78.7922,[494]78.8015,[495]78.7004,[496]78.5756,[497]78.4860,[498]78.3539,[499]78.2442,[500]78.3182,[501]78.2717,[502]78.3639,[503]78.2322,[504]78.3304,[505]78.2150,[506]78.3179,[507]78.3095,[508]78.3384,[509]78.1947,[510]78.2006,[511]78.2526,[512]78.2743,[513]78.3435,[514]78.3612,[515]78.3154,[516]78.1985,[517]78.2343,[518]78.2949,[519]78.3075,[520]78.4294,[521]78.2518,[522]78.1125,[523]78.1626,[524]78.2124,[525]78.2667,[526]78.2024,[527]77.9745,[528]78.0344,[529]77.8533,[530]77.6433,[531]77.5433,[532]77.0303,[533]77.0269,[534]77.1059,[535]77.0778,[536]77.1151,[537]77.1756,[538]77.2906,[539]77.4305,[540]77.4170,[541]77.5524,[542]77.6613,[543]77.7994,[544]77.8804,[545]77.9306,[546]77.9287,[547]78.0057,[548]78.0461,[549]78.0829,[550]78.2113,[551]78.3108,[552]78.3873,[553]78.4217,[554]78.5062,[555]78.5587,[556]78.5389,[557]78.6403,[558]78.7766,[559]78.8293,[560]78.9632,[561]78.9693,
llama_print_timings:        load time =   31419.46 ms
llama_print_timings:      sample time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time = 2597239.00 ms / 287232 tokens (    9.04 ms per token,   110.59 tokens per second)
llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time = 2600677.72 ms / 287233 tokens

Final estimate: PPL = 78.9693 +/- 0.66476

ubergarm `IQ2_K_R4`

Another experimental quant with q8_0 for all GPU layers (with room for 32k context still) and down=iq3_k_r4 and gate/up=iq2_k_r4 for -ot exps=CPU CPU offload.

CUDA_VISIBLE_DEVICES="0," \
./build/bin/llama-perplexity \
    --model /mnt/raid/models/ubergarm/DeepSeek-R1-GGUF/DeepSeek-R1-IQ2_K_R4.gguf \
    -ctk q8_0 \
    -mla 2 -fa \
    -amb 512 \
    -fmoe \
    --ctx-size 512 \
    --ubatch-size 512 \
    -f wiki.test.raw \
    --seed 1337 \
    --n-gpu-layers 63 \
    --override-tensor exps=CPU \
    --threads 24

main: build = 3601 (3d6e25c8)
main: built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
main: seed  = 1337

llama_model_loader: - type  f32:  361 tensors
llama_model_loader: - type q8_0:  612 tensors
llama_model_loader: - type iq2_k_r4:  116 tensors
llama_model_loader: - type iq3_k_r4:   58 tensors

system_info: n_threads = 24 / 48 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
perplexity: tokenizing the input ..
perplexity: tokenization took 611.597 ms
perplexity: calculating perplexity over 561 chunks, n_ctx=512, batch_size=2048, n_seq=4
perplexity: 20.37 seconds per pass - ETA 47.62 minutes
[1]2.8167,[2]3.5984,[3]2.5279,[4]2.1350,[5]1.9307,[6]1.8199,[7]1.7183,[8]1.6549,[9]1.6132,[10]1.5715,[11]1.5652,[12]1.6259,[13]1.6478,[14]1.7798,[15]1.9153,[16]1.9692,[17]2.1392,[18]2.2755,[19]2.2279,[20]2.2171,[21]2.3203,[22]2.2886,[23]2.2519,[24]2.2700,[25]2.2320,[26]2.2026,[27]2.2543,[28]2.2624,[29]2.3195,[30]2.3504,[31]2.3870,[32]2.4029,[33]2.4421,[34]2.4923,[35]2.5471,[36]2.6029,[37]2.6384,[38]2.6881,[39]2.7250,[40]2.7885,[41]2.8333,[42]2.8477,[43]2.9012,[44]2.9163,[45]3.0018,[46]3.0529,[47]3.0155,[48]2.9704,[49]2.9533,[50]2.9794,[51]3.0260,[52]3.0432,[53]3.1013,[54]3.1143,[55]3.1468,[56]3.1829,[57]3.2004,[58]3.2455,[59]3.2565,[60]3.3071,[61]3.3500,[62]3.4085,[63]3.4443,[64]3.4925,[65]3.5020,[66]3.4960,[67]3.4727,[68]3.5045,[69]3.5053,[70]3.5287,[71]3.5449,[72]3.5590,[73]3.5715,[74]3.5914,[75]3.5693,[76]3.5179,[77]3.4743,[78]3.4715,[79]3.4516,[80]3.4385,[81]3.4028,[82]3.4083,[83]3.3817,[84]3.3448,[85]3.3113,[86]3.2904,[87]3.2976,[88]3.2723,[89]3.2646,[90]3.2395,[91]3.2150,[92]3.1917,[93]3.1638,[94]3.1410,[95]3.1215,[96]3.1248,[97]3.1335,[98]3.1231,[99]3.1061,[100]3.1060,[101]3.0979,[102]3.1176,[103]3.1448,[104]3.1673,[105]3.1652,[106]3.1920,[107]3.2174,[108]3.2381,[109]3.2746,[110]3.3091,[111]3.3311,[112]3.3003,[113]3.2870,[114]3.2635,[115]3.2465,[116]3.2384,[117]3.2167,[118]3.1937,[119]3.1713,[120]3.1487,[121]3.1329,[122]3.1128,[123]3.0950,[124]3.0722,[125]3.0524,[126]3.0345,[127]3.0218,[128]3.0145,[129]3.0055,[130]2.9943,[131]2.9862,[132]2.9922,[133]2.9999,[134]3.0062,[135]3.0185,[136]3.0349,[137]3.0503,[138]3.0577,[139]3.0696,[140]3.0682,[141]3.0675,[142]3.0642,[143]3.0624,[144]3.0560,[145]3.0458,[146]3.0428,[147]3.0450,[148]3.0424,[149]3.0424,[150]3.0349,[151]3.0310,[152]3.0262,[153]3.0201,[154]3.0184,[155]3.0218,[156]3.0224,[157]3.0273,[158]3.0364,[159]3.0374,[160]3.0464,[161]3.0545,[162]3.0632,[163]3.0686,[164]3.0893,[165]3.1137,[166]3.1324,[167]3.1459,[168]3.1722,[169]3.1956,[170]3.2185,[171]3.2428,[172]3.2243,[173]3.2042,[174]3.1909,[175]3.1779,[176]3.1654,[177]3.1541,[178]3.1408,[179]3.1267,[180]3.1301,[181]3.1442,[182]3.1594,[183]3.1742,[184]3.1882,[185]3.1979,[186]3.2146,[187]3.2298,[188]3.2433,[189]3.2538,[190]3.2533,[191]3.2597,[192]3.2620,[193]3.2666,[194]3.2868,[195]3.2961,[196]3.3094,[197]3.3196,[198]3.3230,[199]3.3280,[200]3.3258,[201]3.3412,[202]3.3351,[203]3.3396,[204]3.3417,[205]3.3418,[206]3.3442,[207]3.3534,[208]3.3635,[209]3.3729,[210]3.3721,[211]3.3663,[212]3.3666,[213]3.3746,[214]3.3760,[215]3.3822,[216]3.3823,[217]3.3756,[218]3.3754,[219]3.3761,[220]3.3743,[221]3.3739,[222]3.3731,[223]3.3745,[224]3.3794,[225]3.3812,[226]3.3714,[227]3.3702,[228]3.3716,[229]3.3757,[230]3.3812,[231]3.3870,[232]3.3788,[233]3.3715,[234]3.3735,[235]3.3734,[236]3.3822,[237]3.3904,[238]3.4001,[239]3.4104,[240]3.4189,[241]3.4301,[242]3.4457,[243]3.4594,[244]3.4676,[245]3.4795,[246]3.4902,[247]3.4876,[248]3.4827,[249]3.4802,[250]3.4725,[251]3.4688,[252]3.4704,[253]3.4731,[254]3.4793,[255]3.4855,[256]3.4890,[257]3.4906,[258]3.4907,[259]3.4927,[260]3.4949,[261]3.4954,[262]3.4931,[263]3.4987,[264]3.5010,[265]3.5011,[266]3.5027,[267]3.5054,[268]3.5099,[269]3.5128,[270]3.5109,[271]3.5089,[272]3.5014,[273]3.5018,[274]3.4945,[275]3.4831,[276]3.4719,[277]3.4732,[278]3.4836,[279]3.4894,[280]3.4974,[281]3.5045,[282]3.5104,[283]3.5171,[284]3.5233,[285]3.5375,[286]3.5392,[287]3.5420,[288]3.5462,[289]3.5486,[290]3.5395,[291]3.5314,[292]3.5335,[293]3.5346,[294]3.5327,[295]3.5317,[296]3.5342,[297]3.5356,[298]3.5404,[299]3.5472,[300]3.5502,[301]3.5536,[302]3.5554,[303]3.5564,[304]3.5546,[305]3.5669,[306]3.5741,[307]3.5855,[308]3.5734,[309]3.5676,[310]3.5575,[311]3.5611,[312]3.5644,[313]3.5713,[314]3.5734,[315]3.5763,[316]3.5771,[317]3.5780,[318]3.5784,[319]3.5792,[320]3.5834,[321]3.5835,[322]3.5852,[323]3.5914,[324]3.5913,[325]3.5967,[326]3.6011,[327]3.6050,[328]3.6073,[329]3.6086,[330]3.6146,[331]3.6183,[332]3.6224,[333]3.6204,[334]3.6199,[335]3.6193,[336]3.6187,[337]3.6194,[338]3.6192,[339]3.6215,[340]3.6248,[341]3.6304,[342]3.6399,[343]3.6496,[344]3.6548,[345]3.6471,[346]3.6407,[347]3.6381,[348]3.6305,[349]3.6265,[350]3.6247,[351]3.6297,[352]3.6453,[353]3.6544,[354]3.6677,[355]3.6766,[356]3.6830,[357]3.6952,[358]3.7059,[359]3.7091,[360]3.7151,[361]3.7246,[362]3.7337,[363]3.7394,[364]3.7462,[365]3.7520,[366]3.7629,[367]3.7718,[368]3.7787,[369]3.7863,[370]3.7948,[371]3.8090,[372]3.8188,[373]3.8216,[374]3.8250,[375]3.8296,[376]3.8427,[377]3.8541,[378]3.8562,[379]3.8550,[380]3.8515,[381]3.8561,[382]3.8620,[383]3.8653,[384]3.8698,[385]3.8737,[386]3.8797,[387]3.8852,[388]3.8884,[389]3.8764,[390]3.8669,[391]3.8562,[392]3.8500,[393]3.8403,[394]3.8315,[395]3.8224,[396]3.8120,[397]3.8024,[398]3.7916,[399]3.7813,[400]3.7720,[401]3.7610,[402]3.7497,[403]3.7400,[404]3.7283,[405]3.7171,[406]3.7060,[407]3.6953,[408]3.6859,[409]3.6767,[410]3.6704,[411]3.6721,[412]3.6675,[413]3.6708,[414]3.6744,[415]3.6716,[416]3.6722,[417]3.6743,[418]3.6686,[419]3.6700,[420]3.6670,[421]3.6655,[422]3.6680,[423]3.6679,[424]3.6724,[425]3.6721,[426]3.6730,[427]3.6723,[428]3.6754,[429]3.6767,[430]3.6800,[431]3.6808,[432]3.6794,[433]3.6754,[434]3.6759,[435]3.6699,[436]3.6642,[437]3.6599,[438]3.6578,[439]3.6563,[440]3.6613,[441]3.6664,[442]3.6743,[443]3.6722,[444]3.6726,[445]3.6734,[446]3.6784,[447]3.6816,[448]3.6841,[449]3.6867,[450]3.6906,[451]3.6941,[452]3.6967,[453]3.6982,[454]3.6964,[455]3.6985,[456]3.6982,[457]3.7008,[458]3.7059,[459]3.7063,[460]3.7060,[461]3.7018,[462]3.7057,[463]3.7133,[464]3.7193,[465]3.7124,[466]3.7106,[467]3.7094,[468]3.7118,[469]3.7091,[470]3.7064,[471]3.7068,[472]3.7077,[473]3.7068,[474]3.7055,[475]3.7070,[476]3.7055,[477]3.7043,[478]3.7053,[479]3.7071,[480]3.7095,[481]3.7052,[482]3.7088,[483]3.7075,[484]3.7110,[485]3.7175,[486]3.7204,[487]3.7238,[488]3.7292,[489]3.7315,[490]3.7362,[491]3.7426,[492]3.7472,[493]3.7465,[494]3.7474,[495]3.7497,[496]3.7512,[497]3.7541,[498]3.7543,[499]3.7532,[500]3.7569,[501]3.7613,[502]3.7604,[503]3.7586,[504]3.7608,[505]3.7641,[506]3.7728,[507]3.7754,[508]3.7785,[509]3.7704,[510]3.7659,[511]3.7599,[512]3.7561,[513]3.7495,[514]3.7488,[515]3.7515,[516]3.7472,[517]3.7477,[518]3.7471,[519]3.7481,[520]3.7532,[521]3.7515,[522]3.7495,[523]3.7557,[524]3.7544,[525]3.7533,[526]3.7488,[527]3.7433,[528]3.7407,[529]3.7373,[530]3.7342,[531]3.7305,[532]3.7239,[533]3.7171,[534]3.7130,[535]3.7146,[536]3.7176,[537]3.7211,[538]3.7247,[539]3.7276,[540]3.7332,[541]3.7369,[542]3.7395,[543]3.7350,[544]3.7308,[545]3.7304,[546]3.7231,[547]3.7171,[548]3.7102,[549]3.7039,[550]3.6979,[551]3.6923,[552]3.6866,[553]3.6810,[554]3.6803,[555]3.6789,[556]3.6814,[557]3.6851,[558]3.6912,[559]3.6956,[560]3.7011,[561]3.6989,
Final estimate: PPL = 3.6989 +/- 0.02106

llama_print_timings:        load time =   51361.04 ms
llama_print_timings:      sample time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time = 2841460.32 ms / 287232 tokens (    9.89 ms per token,   101.09 tokens per second)
llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time = 2844956.64 ms / 287233 tokens

Debugging Crashes

Usually no need to do this, as any asserts will print the line number direclty.

# re-Build with Debugging symbols and CUDA backend enabled
git pull
git checkout ik/prepare_wk_b

cmake -B ./build -DCMAKE_BUILD_TYPE=Debug -DGGML_CUDA=ON -DGGML_BLAS=OFF
cmake --build ./build --config Debug -j $(nproc)

git rev-parse --short HEAD
1324de97

./build/bin/llama-server --version
version: 3594 (1324de97)
built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu

# Run it in gdb
CUDA_VISIBLE_DEVICES="0," \
gdb ./build/bin/llama-server

(gdb) run \
./build/bin/llama-server \
      --verbose \
      --alias unsloth/DeepSeek-R1-UD-Q2_K_XL \
      --model /mnt/raid/models/unsloth/DeepSeek-R1-GGUF/DeepSeek-R1-UD-Q2_K_XL/DeepSeek-R1-UD-Q2_K_XL-00001-of-00005.gguf \
      --ctx-size 4096 \
      --parallel 1 \
      -mla 2 -fa \
      -amb 2048 \
      -fmoe \
      -rtr \
      --n-gpu-layers 63 \
      --override-tensor exps=CPU \
      --threads 24 \
      --host 127.0.0.1 \
      --port 8080

.
CRASH
.

# Print backtrace after it crashes/segfaults
(gdb) bt

.
.
.

TODO

Enumerate features with examples and links to PRs
Show specific examples of making your own quants with brief discussion and perplexity comparison
Benchmark various configurations against llama.cpp@main, llama.cpp w/ experimental branches, and ktransformers.

References

ubergarm · 2025-03-14T20:34:10Z

ubergarm
Mar 14, 2025
Author

@saood06

I trolled through some of the PRs you linked to me and pulled together this rough guide as my notes for getting started with ik_llama.cpp. Thanks for pointing me in the right direction.

The biggest hurdle so far is needing a custom quant for MLA support. I'll work on that another time as I'm using og unsloth UD-Q2_K_XL which fits in this systems 256GB RAM.

My initial impression is with the right settings it can get faster prompt processing than ktransformers and about the same token generation.

Looking forward to trying it with an MLA supported quant.

4 replies

saood06 Mar 15, 2025
Collaborator

I trolled through some of the PRs you linked to me and pulled together this rough guide as my notes for getting started with ik_llama.cpp. Thanks for pointing me in the right direction.

Glad I can be of help. I've seen a lot of people show interest in using ik_llama.cpp but the amount of options and the spread out documentation was a deterrent. This guide (even in it's current state) is a much better resource to give people than my explanations and links to PR's, so thank you for putting it together.

The biggest hurdle so far is needing a custom quant for MLA support. I'll work on that another time as I'm using og unsloth UD-Q2_K_XL which fits in this systems 256GB RAM.

You seemed to have found all the huggingface MLA quants I know of but I forgot to mention that you can use the technique listed here in order to skip a step if you are going to manually convert from the original fp8 model files. (I've thought about porting that here but the triton dependence adds more complication than I think it is worth for most people, when more fp8 native models are released, I think something along the lines of this is the best path forward).

I think reading through this discussion #242 (most relevant bits are this, this, and this but there are other bits of the discussion that are worth reading if you are making your own imatrix as you may run into similar issues, but as mentioned you can just use an imatrix from someone else, just make sure to set the new MLA tensors to high quant types as those won't be in any imatrix unless they created it with MLA.

Making a custom quant has a lot of flexibility in terms of quality, size, and performance (for example the quant of the attention tensors and shared experts has much lower impact on size, but has larger impacts on quality and size, whereas the quant of the non-shared experts has a much larger impact on size, and a smaller impact on performance). This is demonstrated here where the custom blend that is smaller had lower PPL than the IQ4_KSS quant. There is a lot more discussion about quants in that thread (and it is where the issue of CUDA for certain tensors was first noticed).

My initial impression is with the right settings it can get faster prompt processing than ktransformers and about the same token generation.

Looking forward to trying it with an MLA supported quant.

I think ktransformers will outperform ik_llama.cpp without MLA for TG at higher context lengths as it uses MLA. The higher PP is nice, I wonder if the lead is still held with MLA.

Also you may find #225 useful for benchmarking.

magikRUKKOLA Jul 13, 2025

@saood06 please keep in mind that there is no such thing as comparing the performance of ik_llama.cpp with ktransformers. Simply because the ktransformers is using old fork of flashinfer (see 0.2.3). If simply put, you will get either crash in the sampler or the garbage output (or lost context). Yeah, I initially thought ik_llama.cpp suck because the decode speed is slower (esp. on a long context because they dont't use matrix absorption etrc.) .. but ... there is simply no way to run ktransformers with large context. ktransformers doesn't even have the --seed parameter implemented lol so each time the llm answers you you can't tell if its a right answer or its a garbage lol. ktransformers was written by script-kiddies (I looked at the code -- its awful). So please be serious.

saood06 Jul 13, 2025
Collaborator

@saood06 please keep in mind that there is no such thing as comparing the performance of ik_llama.cpp with ktransformers. [...] So please be serious.

Not sure why you are replying to old comments. I said in a later comment in this same discussion page, "Even then and still now I still see ktransformers as more of a performance demo because of how limited it is in what it supports both in hardware and the server/API they expose."

magikRUKKOLA Jul 13, 2025

@saood06 please keep in mind that there is no such thing as comparing the performance of ik_llama.cpp with ktransformers. [...] So please be serious.

Not sure why you are replying to old comments. I said in a later comment in this same discussion page, "Even then and still now I still see ktransformers as more of a performance demo because of how limited it is in what it supports both in hardware and the server/API they expose."

Well, you didn't say that ktransformers ARE unusable. I am saying that.

Its not about the stuff that it supports or not. The problem is that they claiming to support 128k context when in reality it just crashes or outputs the garbage. So anyone reading this thread should be aware to not waste any time with ktransformers. That's it.

ikawrakow · 2025-03-15T09:16:27Z

ikawrakow
Mar 15, 2025
Maintainer

Thank you for these results.

The biggest hurdle so far is needing a custom quant for MLA support

#259 should remove this hurdle. With this PR models prepared with mainline llama.cpp can be used also with MLA enabled.

0 replies

saood06 · 2025-03-16T03:37:18Z

saood06
Mar 16, 2025
Collaborator

@ikawrakow

# Results for unsloth/DeepSeek-R1-GGUF/DeepSeek-R1-UD-Q2_K_XL/DeepSeek-R1-UD-Q2_K_XL-00001-of-00005.gguf
# was getting nan's even without -mla 2 -fa -amb 2048 -fmoe. switched to default --ubatch-size 512 and nan's appear later in the sequence

Just thought you'd want to know this, manually notifying you as edit's don't trigger notifications.

11 replies

ubergarm Mar 19, 2025
Author

I'm curious about full PPL runs.

Yeah, looking more I see the full run is more useful for easy comparisons than just the first N chunks.

That is for his custom quant types IQ_K quants (#8), the nans in unsloth's quant won't be helped by that.

I see. Are you aware of other quants that throw nan on CPU backends? Because, I've been trying to run perplexity on unsloth/DeepSeek-R1-Q8_0 as the Q8_0 would make a nice baseline for comparison. However, on the intel6980P compiled for CPU only its throwing all nan. Right, the recent the recent PR fixes IQ_K quants on CUDA.

It runs the Q4_K_M clean to the end, so maybe Q8_0 only?

There were no nans running it with vanilla llama.cpp@main earlier this week. I tried a lot of things with ik_llama.cpp llama-perplexity including various options combinations, not using -rtr, exact same command as vanilla, and different git sha's from today through a few days ago. No luck.

See here for exact logs. Let me know if you think I should open an issue or of maybe just user error?

`ik_llama.cpp llama-perplexity` logs.

$ numactl -N 0 -m 0 \
./build/bin/llama-perplexity \
    --model /mnt/ai/models/unsloth/DeepSeek-R1-GGUF/DeepSeek-R1-Q8_0/DeepSeek-R1.Q8_0-00001-of-00015.gguf \
    -rtr \
    -ctk q8_0 \
    -mla 2 -fa \
    -amb 512 \
    -fmoe \
    --ctx-size 512 \
    --ubatch-size 512 \
    -f wiki.test.raw \
    --numa numactl \
    --threads 128

# also similar results on `ik_llama.cpp@f2fb15de` without fancy options etc.
main: build = 3597 (127c6ee6)                                                                                                      20:14:51 [199/1921]
main: built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
main: seed  = 1742415291
WARNING: /proc/sys/kernel/numa_balancing is enabled, this has been observed to impair performance
llama_model_loader: additional 14 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 48 key-value pairs and 1025 tensors from /mnt/ai/models/unsloth/DeepSeek-R1-GGUF/DeepSeek-R1-Q8_0/DeepSeek-R
1.Q8_0-00001-of-00015.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = deepseek2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = DeepSeek R1 BF16
llama_model_loader: - kv   3:                       general.quantized_by str              = Unsloth
llama_model_loader: - kv   4:                         general.size_label str              = 256x20B
llama_model_loader: - kv   5:                           general.repo_url str              = https://huggingface.co/unsloth
llama_model_loader: - kv   6:                      deepseek2.block_count u32              = 61
llama_model_loader: - kv   7:                   deepseek2.context_length u32              = 163840
llama_model_loader: - kv   8:                 deepseek2.embedding_length u32              = 7168
llama_model_loader: - kv   9:              deepseek2.feed_forward_length u32              = 18432
llama_model_loader: - kv  10:             deepseek2.attention.head_count u32              = 128
llama_model_loader: - kv  11:          deepseek2.attention.head_count_kv u32              = 128
llama_model_loader: - kv  12:                   deepseek2.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  13: deepseek2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  14:                deepseek2.expert_used_count u32              = 8
llama_model_loader: - kv  15:                          general.file_type u32              = 7
llama_model_loader: - kv  16:        deepseek2.leading_dense_block_count u32              = 3
llama_model_loader: - kv  17:                       deepseek2.vocab_size u32              = 129280
llama_model_loader: - kv  18:            deepseek2.attention.q_lora_rank u32              = 1536
llama_model_loader: - kv  19:           deepseek2.attention.kv_lora_rank u32              = 512
llama_model_loader: - kv  20:             deepseek2.attention.key_length u32              = 192
llama_model_loader: - kv  21:           deepseek2.attention.value_length u32              = 128
llama_model_loader: - kv  22:       deepseek2.expert_feed_forward_length u32              = 2048
llama_model_loader: - kv  23:                     deepseek2.expert_count u32              = 256
llama_model_loader: - kv  24:              deepseek2.expert_shared_count u32              = 1
llama_model_loader: - kv  25:             deepseek2.expert_weights_scale f32              = 2.500000
llama_model_loader: - kv  26:              deepseek2.expert_weights_norm bool             = true
llama_model_loader: - kv  27:               deepseek2.expert_gating_func u32              = 2
llama_model_loader: - kv  28:             deepseek2.rope.dimension_count u32              = 64
llama_model_loader: - kv  29:                deepseek2.rope.scaling.type str              = yarn
llama_model_loader: - kv  30:              deepseek2.rope.scaling.factor f32              = 40.000000
llama_model_loader: - kv  31: deepseek2.rope.scaling.original_context_length u32              = 4096
llama_model_loader: - kv  32: deepseek2.rope.scaling.yarn_log_multiplier f32              = 0.100000
llama_model_loader: - kv  33:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  34:                         tokenizer.ggml.pre str              = deepseek-v3
llama_model_loader: - kv  35:                      tokenizer.ggml.tokens arr[str,129280]  = ["<｜begin▁of▁sentence｜>", "<...
llama_model_loader: - kv  36:                  tokenizer.ggml.token_type arr[i32,129280]  = [3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  37:                      tokenizer.ggml.merges arr[str,127741]  = ["Ġ t", "Ġ a", "i n", "Ġ Ġ", "h e...
llama_model_loader: - kv  38:                tokenizer.ggml.bos_token_id u32              = 0
llama_model_loader: - kv  39:                tokenizer.ggml.eos_token_id u32              = 1
llama_model_loader: - kv  40:            tokenizer.ggml.padding_token_id u32              = 128815
llama_model_loader: - kv  41:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  42:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  43:                    tokenizer.chat_template str              = {% if not add_generation_prompt is de...
llama_model_loader: - kv  44:               general.quantization_version u32              = 2
llama_model_loader: - kv  45:                                   split.no u16              = 0
llama_model_loader: - kv  46:                                split.count u16              = 15
llama_model_loader: - kv  47:                        split.tensors.count i32              = 1025
llama_model_loader: - type  f32:  361 tensors
llama_model_loader: - type q8_0:  664 tensors
llm_load_vocab: special tokens cache size = 819
llm_load_vocab: token to piece cache size = 0.8223 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = deepseek2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 129280
llm_load_print_meta: n_merges         = 127741
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 163840
llm_load_print_meta: n_embd           = 7168
llm_load_print_meta: n_layer          = 61
llm_load_print_meta: n_head           = 128
llm_load_print_meta: n_head_kv        = 128
llm_load_print_meta: n_rot            = 64
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 192
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 24576
llm_load_print_meta: n_embd_v_gqa     = 16384
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 18432
llm_load_print_meta: n_expert         = 256
llm_load_print_meta: n_expert_used    = 8
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = yarn
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 0.025
llm_load_print_meta: n_ctx_orig_yarn  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 671B
llm_load_print_meta: model ftype      = Q8_0
llm_load_print_meta: model params     = 671.026 B
llm_load_print_meta: model size       = 664.295 GiB (8.504 BPW)
llm_load_print_meta: repeating layers = 662.461 GiB (8.504 BPW, 669.173 B parameters)
llm_load_print_meta: general.name     = DeepSeek R1 BF16
llm_load_print_meta: BOS token        = 0 '<｜begin▁of▁sentence｜>'
llm_load_print_meta: EOS token        = 1 '<｜end▁of▁sentence｜>'
llm_load_print_meta: PAD token        = 128815 '<｜PAD▁TOKEN｜>'
llm_load_print_meta: LF token         = 131 'Ä'
llm_load_print_meta: max token length = 256
llm_load_print_meta: n_layer_dense_lead   = 3
llm_load_print_meta: n_lora_q             = 1536
llm_load_print_meta: n_lora_kv            = 512
llm_load_print_meta: n_ff_exp             = 2048
llm_load_print_meta: n_expert_shared      = 1
llm_load_print_meta: expert_weights_scale = 2.5
llm_load_print_meta: expert_weights_norm  = 1
llm_load_print_meta: expert_gating_func   = sigmoid
llm_load_print_meta: rope_yarn_log_mul    = 0.1000
llm_load_tensors: ggml ctx size =    0.42 MiB
llm_load_tensors:        CPU buffer size = 680237.97 MiB
....................................................................................................
============ llm_load_tensors: need to compute 61 wk_b tensors
Computed blk.0.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
Computed blk.1.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
Computed blk.2.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
Computed blk.3.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
Computed blk.4.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
Computed blk.5.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
Computed blk.6.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
Computed blk.7.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
Computed blk.8.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
Computed blk.9.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
Computed blk.10.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
Computed blk.11.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
Computed blk.12.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
Computed blk.13.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
Computed blk.14.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
Computed blk.15.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
Computed blk.16.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
Computed blk.17.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
Computed blk.18.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
Computed blk.19.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
Computed blk.20.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
Computed blk.21.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
Computed blk.22.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
Computed blk.23.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
Computed blk.24.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
Computed blk.25.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
Computed blk.26.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
Computed blk.27.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
Computed blk.28.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
Computed blk.29.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
Computed blk.30.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
Computed blk.31.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
Computed blk.32.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
Computed blk.33.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
Computed blk.34.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
Computed blk.35.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
Computed blk.36.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
Computed blk.37.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
Computed blk.38.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
Computed blk.39.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
Computed blk.40.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
Computed blk.41.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
Computed blk.42.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
Computed blk.43.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
Computed blk.44.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
Computed blk.45.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
Computed blk.46.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
Computed blk.47.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
Computed blk.48.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
Computed blk.49.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
Computed blk.50.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
Computed blk.51.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
Computed blk.52.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
Computed blk.53.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
Computed blk.54.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
Computed blk.55.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
Computed blk.56.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
Computed blk.57.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
Computed blk.58.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
Computed blk.59.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
Computed blk.60.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
============ Repacked 663 tensors
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn   = 2
llama_new_context_with_model: attn_max_b = 512
llama_new_context_with_model: fused_moe  = 1
llama_new_context_with_model: ser        = -1, 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 0.025
llama_kv_cache_init: layer 0: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 1: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 2: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 3: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 4: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 5: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 6: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 7: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 8: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 9: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 10: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 11: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 12: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 13: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 14: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 15: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 16: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 17: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 18: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 19: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 20: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 21: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 22: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 23: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 24: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 25: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 26: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 27: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 28: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 29: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 30: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 31: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 32: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 33: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 34: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 35: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 36: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 37: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 38: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 39: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 40: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 41: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 42: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 43: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 44: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 45: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 46: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 47: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 48: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 49: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 50: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 51: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 52: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 53: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 54: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 55: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 56: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 57: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 58: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 59: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 60: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init:        CPU KV buffer size =    72.91 MiB
llama_new_context_with_model: KV self size  =   72.91 MiB, c^KV (q8_0):   72.91 MiB, kv^T: not used
llama_new_context_with_model:        CPU  output buffer size =     1.97 MiB
llama_new_context_with_model:        CPU compute buffer size =   450.01 MiB
llama_new_context_with_model: graph nodes  = 3487
llama_new_context_with_model: graph splits = 1

system_info: n_threads = 128 / 512 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 |
NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE =
1 |
perplexity: tokenizing the input ..
perplexity: tokenization took 888.249 ms
perplexity: calculating perplexity over 561 chunks, n_ctx=512, batch_size=2048, n_seq=4
perplexity: 14.92 seconds per pass - ETA 34.85 minutes
[1]nan,[2]nan,[3]nan,[4]nan,[5]nan,[6]nan,[7]nan,[8]nan,[9]nan,[10]nan,[11]nan,[12]nan,[13]nan,[14]nan,[15]nan,[16]nan,[17]nan,[18]nan,[19]nan,[20]nan
,[21]nan,[22]nan,[23]nan,[24]nan,[25]nan,[26]nan,[27]nan,[28]nan,[29]nan,[30]nan,[31]nan,[32]nan,[33]nan,[34]nan,[35]nan,[36]nan,[37]nan,[38]nan,[39]n
an,[40]nan,[41]nan,[42]nan,[43]nan,[44]nan,[45]nan,[46]nan,[47]nan,[48]nan,[49]nan,[50]nan,[51]nan,[52]nan,[53]nan,[54]nan,[55]nan,[56]nan,[57]nan,[58
]nan,[59]nan,[60]nan,[61]nan,[62]nan,[63]nan,[64]nan,[65]nan,[66]nan,[67]nan,[68]nan,[69]nan,[70]nan,[71]nan,[72]nan,[73]nan,[74]nan,[75]nan,[76]nan,[
77]nan,[78]nan,[79]nan,[80]nan,[81]nan,[82]nan,[83]nan,[84]nan,[85]nan,[86]nan,[87]nan,[88]nan,[89]nan,[90]nan,[91]nan,[92]nan,[93]nan,[94]nan,[95]nan
,[96]nan,[97]nan,[98]nan,[99]nan,[100]nan,^C

That was mentioned in the PR that implemented tensor override here

Another recent PR allows for mmap() now so I got my quant running locally around 3 tok/sec. Get almost 4.5 when playing aroun with -ser 5,1 - hope to do some perplexity testing with other -ser settings for comparison. More fun stuff!

vaulter Mar 20, 2025

Hi Guys, I've been struggling on my dual Xeon 8558 (48cores) with 768Gb RAM and Quad 3090 with Q8 (that is on lamma.cpp mainline, Q4_K_S gives me 6-7 tk/s in real world prompting) - gives me nan's, can you recommend and help to create custom quants for my situation? I would like to get best performance and ik_llama.cpp seems on the edge, I've been following this thread but might get lost in details calculating and applying custom quants logic...

ubergarm Mar 20, 2025
Author

@vaulter

I've been struggling on my dual Xeon 8558 (48cores) with 768Gb RAM and Quad 3090 with Q8

Heya, so assuming you have set BIOS to SNC=Disable to get a single NUMA node per CPU socket that means you have 2x NUMA nodes each with 384 GB RAM plus 96GB VRAM. So unfortunately, not enough RAM to run Q8_0 in a single NUMA node. On AMD Epyc using two NUMA nodes gives barely any performance benefit and in my testing with CPU only inference on Intel Xeon gives a performance regression in token generation benchmarks.

Also you don't have enough RAM to run ktransformers compiled with USE_NUMA=1 which enables "data parallel" to load the entire model weights into memory twice (once for each CPU socket's NUMA node). Not efficient, but the main way to get around the issue being explored in implementation that I have seen.

So your best bet is probably as follows:

use ik_llama.cpp
roll a custom quant to take advantage of your 96GB VRAM and offload the rest fitting into a single 384GB RAM NUMA node.
come up with a command to do custom tensor offload of your custom quant to distribute the layers across the 4x GPUs VRAM and 1x NUMA node RAM.

To start out I'd recommend simply trying to run your existing Q4_K_S with ik_llama.cpp looking at the first example quick start in this guide and use only a single GPU at first to get some quick success. You can use a single NUMA node by adding this to the beginning of the command so something like this all together:

CUDA_VISIBLE_DEVICES="0," \
numactl -N 0 -m 0 \
./build/bin/llama-server \
    --alias somequant/DeepSeek-R1-Q4_K_S \
    --model /models/somequant/DeepSeek-R1-Q4_K_S.gguf \
    -rtr \
    --ctx-size 32768 \
    -ctk q8_0 \
    -mla 2 -fa \
    -amb 512 \
    -fmoe \
    --n-gpu-layers 63 \
    --override-tensor exps=CPU \
    --parallel 1 \
    --numa numactl \
    --threads 48 \
    --host 127.0.0.1 \
    --port 8080

# if you get assert error after `============ llm_load_tensors: need to compute 61 wk_b tensors`
# git checkout 68a5b604 # and try with that version

gives me nan's

This is with Q8_0 and vanilla llama.cpp@main?? When doing llama-perplexity or when do you see nan's?

Okay, holler if you get stuck and looking forward to hearing your results! Also feel free to chat about how to make quants, I put some rough notes in this guide where I'm stumbling through the process myself haha...

vaulter Mar 20, 2025

Well, assuming nan is a token with a single D (basically the output is DDDDD...) - I'm using vannilla llama.cpp@main the same way as with Q4_K_S, it loads, and start outputting D's with out any errors, after I close session it give me tok/s stats, prompt eval is also low vs Q4_K_S at around 0,57 tok/s
As for the ik_llama.cpp I'll try and report the results
And I was following your other threads with Granit rapids testing - that was really helpful - so thanks for that work! @ubergarm

vaulter Mar 23, 2025

Ok here is a bit of testing - I was getting around 6-6.7 tok/s on vanilla llama and achieved 10.8 tok/s on ik_llama on 8192 context. That is Q4_K_S. I was getting assert errors so had to checkout the given branch. Currently I've done exact instructions besides I didnt isolate to 1 3090 but used all 4 - anyways its offloads whatever (not expert layers as these are CPU override) at around 11Gb on each GPUs - I'm looking in trying a single CPU 4677 motherboard with 2 dimms per channel - this will give me 768GB but for 1 NUMA node and I probably can try Q8 on it

saood06 · 2025-03-20T01:47:18Z

saood06
Mar 20, 2025
Collaborator

Are you aware of other quants that throw nan on CPU backends?

None that still do that haven't been mentioned in the conversation already, there was an issue with IQ1_S_R4 but that was fixed here: #194

Let me know if you think I should open an issue or of maybe just user error?

Everything looks reasonable to me (especially since you were thorough and tried a bunch of valid combinations, and any valid combination shouldn't NaN on perplexity, but since all of them do that might help narrow down where the problem lies).

Another recent PR allows for mmap() now so I got my quant running locally around 3 tok/sec. Get almost 4.5 when playing aroun with -ser 5,1 - hope to do some perplexity testing with other -ser settings for comparison. More fun stuff!

Nice.

0 replies

saood06 · 2025-03-21T07:32:24Z

saood06
Mar 21, 2025
Collaborator

This is an experimental quant I rolled with q8_0 for all attention/shared experts/embeddings loaded on GPU. The rest of the MoE down exps are iq2_xs_r4 and gate/up exps are iq2_bn_r4. However, perplexity looks pretty bad. So I'll likely aim for larger sized model with higher quality quants and make-up speed/accuracy trade off exploring -ser instead of going very small quants.

I don't think it's the size that is the issue, iq2_bn_r4 is a bitnet quant. I briefly tested an IQ1_S_R4 which didn't even have the benefit of going to q8_0 for the non expert tensors like you did and I still got FAR more reasonable perplexity numbers (exact values here, with the quant log here )

If you are still experimenting with quant types, you might be able to improve on your Q2_K_R4 at around the same size by replacing the q2_k_r4, and q3_k_r4 which are k quants with similar sized i quants or iqk quants instead of using k quants, this PR #85 has a really nice chart focusing on that quant range (caveat IQ3_KL is not a quant type, it is a quant recipe), and shows how the three different quant types (i, k and iqk) stack up.

10 replies

saood06 Mar 23, 2025
Collaborator

Waiting for the Qwen to drop an MoE with MLA that an iq4_k_r4 quant will fit into 96GB RAM + 24GB VRAM lmao... 🤞

Does WizardLM-2-8x22B or any other 8x22B interest you as that could fit, and someone tried it (albeit on llama.cpp) here and got good results.

Will keep you posted when I run some benchmarks!

Thanks, I periodically check on this page as github doesn't notify on edits.

ubergarm Mar 23, 2025
Author

I ran a quick comparison between the Q2_K_R4 and the IQ2_K_R4 which do seem like the better choices for CPU inferencing over IQ2_XS and family.

For this specific config seems like pp is slightly slower but tg is slightly faster! With basically the same perplexity and 5% smaller, these non-linear IQ?_K_R4 do seem like a great choice for CPU inferencing.

model	size	test	t/s
Q2_K_R4	238.69 GiB	pp512	112.21 ± 0.74
Q2_K_R4	238.69 GiB	pp8192	97.59 ± 1.21
Q2_K_R4	238.69 GiB	pp16384	83.55 ± 1.56
Q2_K_R4	238.69 GiB	tg64@pp512	10.05 ± 0.00
Q2_K_R4	238.69 GiB	tg64@pp8192	8.97 ± 0.01
Q2_K_R4	238.69 GiB	tg64@pp16384	7.93 ± 0.01
---------------------	------------:	---------------:
IQ2_K_R4	226.00 GiB	pp512	105.33 ± 0.46
IQ2_K_R4	226.00 GiB	pp8192	93.17 ± 0.70
IQ2_K_R4	226.00 GiB	pp16384	81.67 ± 1.51
IQ2_K_R4	226.00 GiB	tg64@pp512	10.32 ± 0.00
IQ2_K_R4	226.00 GiB	tg64@pp8192	9.16 ± 0.02
IQ2_K_R4	226.00 GiB	tg64@pp16384	8.10 ± 0.02

saood06 Mar 23, 2025
Collaborator

With basically the same perplexity and 5% smaller, these non-linear IQ?_K_R4 do seem like a great choice for CPU inferencing.

Yes, I basically always use IQK quants, and at higher bpw levels ( where I-quants do not exist) they are often a far better quality option at their size (see: the data in #83 and #89) which is why for models that I use in the 4.25-7 bpw range I make an IQK quant (with an imatrix).

ikawrakow Mar 23, 2025
Maintainer

Does WizardLM-2-8x22B or any other 8x22B interest you as that could fit, and someone tried it (albeit on llama.cpp) ggml-org/llama.cpp#11397 (comment) and got good results.

Quantized 8x22B is something I can run on my Ryzen-5975WX. I get PP-512=61 t/s, TG-128 = 2.16 t/s running CPU-only for the Q4_K_M model used in the linked post. They said that the difference between 100 t/s and 74 t/s wasn't that important, so based on that logic, I'm matching the performance of 3 GPUs for PP 😄

ikawrakow Mar 23, 2025
Maintainer

With my paltry 16 GB RTX-4080 that is in the Ryzen-7950WX box, I get PP-512 = 80 t/s and TG-128 = 3.1 t/s using

-ot "blk\.[0-6]\.ffn=CUDA0,exps=CPU" -rtr -t 32 -ngl 100

ikawrakow · 2025-03-21T15:49:36Z

ikawrakow
Mar 21, 2025
Maintainer

Okay so its not the size but the bitnet quants are not currently great.

They are actually great. But they are Bitnet quants, so quants for a model that has been trained such that model weights take one of 3 possible values (-1, 0, 1). Hence, they absolutely cannot be used for normal models trained using actual floats. But that does not make them not great. The ternary quants in this repo (IQ2_BN, IQ1_BN) have, as far as I can tell, by far the fastest CPU implementation around.

1 reply

ubergarm Mar 21, 2025
Author

Okay gotchu. Yeah I picked them hoping they were fast, but given R1 was not trained as a bitnet they are not the right match for this specific case.

ikawrakow · 2025-03-21T17:26:50Z

ikawrakow
Mar 21, 2025
Maintainer

The iq3_k_r4/iq2_k_r4 MoE mix that you are cooking should work out to about 207 GiB for the experts (3.582 GiB per layer). It may be useful to have a few MoE layers quantized with more bits (e.g., iq4_k_r4 for ffn_downandiq3_k_r4forffn_up/fate`). If you do the first 8 MoE layers like that, it will add about 11.2 GiB to the weights stored on the CPU.

0 replies

anikifoss · 2025-04-08T16:39:03Z

anikifoss
Apr 8, 2025

@ubergarm huge thanks for this guide! Any chance you could publish the DeepSeek-R1_Q2_K_R4 quant described here?

First of all, thanks for doing all the research on running DeepSeek-R1 locally and publishing high quality technical details. Your posts on level1techs and reddit are currently the only good sources of information available on the subject. My internet searches related to purchasing decisions for running DSR1 always end up on one of your posts!

I started with a 7975wx system for CPU only inference, and overclocked the memory controller based on your benchmarking on level1techs. Then, based on this guide, I ended up shelling out for an RTX 5090. Switching from CPU only inferencw with ollama to CPU+GPU inferece with ik_llama resulted in a 5x inference speedup. The speed improvement are more pronounced for longer contexts, I am able to get roughly 10 tps inference on a 40k context with the unsloth/DeepSeek-R1-UD-Q2_K_XL quant.

Since 5090 has more memory, I offloaded all the small layers onto the GPU with --override-tensor down_exps=CPU,gate_exps=CPU,up_exps=CPU, though the speedup from that was minor.

./build/bin/llama-server \
    --alias unsloth/DeepSeek-R1-UD-Q2_K_XL \
    --model /mnt/models/deepseek-ai/DeepSeek-R1-UD-Q2_K_XL/DeepSeek-R1-UD-Q2_K_XL.gguf \
    -rtr \
    --ctx-size 106496 \
    -ctk f16 -ctv f16 \
    -mla 2 -fa \
    -amb 1024 \
    -fmoe \
    --n-gpu-layers 200 \
    --override-tensor down_exps=CPU,gate_exps=CPU,up_exps=CPU \
    --parallel 1 \
    --threads 32 \
    --host 127.0.0.1 \
    --port 8090

Would love to get my hands on the DeepSeek-R1_Q2_K_R4 quant!

0 replies

ubergarm · 2025-04-08T17:07:44Z

ubergarm
Apr 8, 2025
Author

Heya @anikiforovopensource , I appreciate the feedback, its been great working with tools provided by the great developers to push the envelope! Glad you have found some of this useful

Any chance you could publish the DeepSeek-R1_Q2_K_R4 quant described here?

I updated the guide with a link to the hugging face repo that contains a couple ik_llama.cpp exclusive quants:

https://huggingface.co/ubergarm/DeepSeek-V3-0324-GGUF

Sorry it is difficult to piece together all the bread crumbs across so many sites, but sounds like you are having good success.

Since 5090 has more memory, I offloaded all the small layers onto the GPU with --override-tensor down_exps=CPU,gate_exps=CPU,up_exps=CPU, though the speedup from that was minor.

The 5090 is pretty great size 32GB VRAM for the quants I made actually. Use the CPU+GPU example on the model card, you want to be using -ot exps=CPU to put only routed experts on CPU RAM. As mentioned by ik, that is the "special sauce" of ktransformers. We go a step further here by optimizing the quants for GPU or CPU inferencing. You can probably fit almost 128k context in I'm guessing with this setup with either of the quants I published given the VRAM weights are exactly the same, only the CPU weights are different.

I would recommend:

Use the IQ2_K_R4 if you have 256GB system RAM
Use the IQ4_K_R4 if you have 512GB system RAM

I'd love to see any benchmark results, you can see how to run llama-sweep-bench here if you are interested. Just adjust the command to match your CPU+GPU setup like I show in the model card.

Cheers and good luck, sounds like you have a great rig to experiment!

0 replies

ikawrakow · 2025-04-08T17:43:47Z

ikawrakow
Apr 8, 2025
Maintainer

Switching from CPU only inferencw with ollama to CPU+GPU inferece with ik_llama resulted in a 5x inference speedup.

Where are my 136k stars 😃

0 replies

fredlas · 2025-04-08T18:50:04Z

fredlas
Apr 8, 2025

Has something changed with how llama-quantize wants the --custom-q flag to be formatted? I'm trying to follow the example, but it won't accept most of the types there. As far as I can tell it only wants to accept "classic" types like q8_0, not q5_k.

Specifically, it gives me e.g.
"Invalid quantization type 'q5_k' in custom quantization input blk.[3-4].ffn_gate_exps.weight=q5_k"

0 replies

ikawrakow · 2025-04-08T18:57:45Z

ikawrakow
Apr 8, 2025
Maintainer

There have been no changes related to custom quants. Can you post your full command? llama-quantize error messages can be misleading sometimes.

0 replies

fredlas · 2025-04-08T19:04:38Z

fredlas
Apr 8, 2025

Sure! I arrived at:

custom2="token_embd\.weight=q8_0,output\.weight=q8_0,output_norm\.weight=q8_0,blk\.[0-2]\..*=q8_0,blk\.[3-4]\.ffn_down_exps\.weight=q8_0,blk\.[3-4]\.ffn_gate_exps\.weight=q5_k,blk\.[3-4]\.ffn_up_exps\.weight=iq4_xs,blk\.[5-9]\.ffn_down_exps\.weight=q5_k,blk\.[5-9]\.ffn_gate_exps\.weight=q5_k,blk\.[5-9]\.ffn_up_exps\.weight=q5_k,blk\.1[0-1]\.ffn_down_exps\.weight=iq4_xs,blk\.1[0-1]\.ffn_gate_exps\.weight=iq4_xs,blk\.1[0-1]\.ffn_up_exps\.weight=iq4_xs,blk\.1[2-8]\.ffn_down_exps\.weight=q5_k,blk\.1[2-8]\.ffn_gate_exps\.weight=q5_k,blk\.1[2-8]\.ffn_up_exps\.weight=iq4_xs,blk\.19\.ffn_down_exps\.weight=iq4_xs,blk\.19\.ffn_gate_exps\.weight=iq3_s,blk\.19\.ffn_up_exps\.weight=iq3_s,blk\.[2-5][0-9]\.ffn_down_exps\.weight=iq4_xs,blk\.[2-5][0-9]\.ffn_gate_exps\.weight=iq3_s,blk\.[2-5][0-9]\.ffn_up_exps\.weight=iq3_s,blk\.60\.ffn_down_exps\.weight=iq4_xs,blk\.60\.ffn_gate_exps\.weight=iq3_s,blk\.60\.ffn_up_exps\.weight=iq3_s,blk\.[3-9]\.attn_.*=q8_0,blk\.[1-5][0-9]\.attn_.*=q8_0,blk\.60\.attn_.*=q8_0,blk\.[3-9]\.ffn_norm\.weight=q8_0,blk\.[1-5][0-9]\.ffn_norm\.weight=q8_0,blk\.60\.ffn_norm\.weight=q8_0,blk\.[3-9]\.exp_probs_b\.bias=q8_0,blk\.[1-5][0-9]\.exp_probs_b\.bias=q8_0,blk\.60\.exp_probs_b\.bias=q8_0,blk\.3\.ffn_.*shexp\.weight=q8_0,blk\.[4-9]\.ffn_.*shexp\.weight=q8_0,blk\.[1-5][0-9]\.ffn_.*shexp\.weight=q8_0,blk\.60\.ffn_.*shexp\.weight=q8_0"

./ik_llama.cpp/build/bin/llama-quantize \
    --imatrix /home/fred/imatrices/imatrix-bartowski-DeepSeek-R1.dat \
    --token-embedding-type q8_0 \
    --output-tensor-type q8_0 \
    --custom-q "$custom2" \
    /home/fred/usb/deepseek_r1_bf16/Downloads-256x21B-BF16-00001-of-00030.gguf \
    /home/fred/usb/deepseek_r1_my_mostlyq5/DeepSeek-R1-GGUF/DeepSeek-R1-my_mostly_q5.gguf \
    Q5_K \
    28

It also doesn't like q6_k, but is ok with q4_0. I dug around a little, but ggml_type_name() ended up at some opaque array access thing, and I'm also having trouble finding where ggml_type's enum values are listed.

0 replies

ikawrakow · 2025-04-08T19:10:47Z

ikawrakow
Apr 8, 2025
Maintainer

Oh, this is Kawrakow-style usability at its best!

The "K" in k-quants need to be capitalized. So, q5_K, not q5_k.

This applies only to q2_K, q3_K, q4_K, q5_K, q6_K. In the other cases (iq4_k, etc.) it is small k.

1 reply

fredlas Apr 8, 2025

Oh man, thanks. I actually tried different capitalizations, but hadn't gone as far as mixing them!

anikifoss · 2025-04-08T22:32:55Z

anikifoss
Apr 8, 2025

Ok, I run the benchmarks, results are below. System: 7975wx with FCLK=2100 , 768G RAM at 5600MHz, RTX 5090.

unsloth/DeepSeek-R1-UD-Q2_K_XL_more pushes more layers onto the GPU
unsloth/DeepSeek-R1-UD-Q2_K_XL_attn uses exps=CPU
ubergarm/DeepSeek-V3-0324-IQ2_K_R4_more pushes more layers onto the GPU
ubergarm/DeepSeek-V3-0324-IQ2_K_R4_attn uses exps=CPU

Partial benchmark logs

unsloth/DeepSeek-R1-UD-Q2_K_XL

--override-tensor down_exps=CPU,gate_exps=CPU,up_exps=CPU

./build/bin/llama-sweep-bench
--alias unsloth/DeepSeek-R1-UD-Q2_K_XL
--model /mnt/models/deepseek-ai/DeepSeek-R1-UD-Q2_K_XL/DeepSeek-R1-UD-Q2_K_XL.gguf
--run-time-repack
--no-mmap
-ctk f16 -ctv f16
-mla 2 -fa
-amb 1024
-fmoe
--ctx-size 32768
-ub 512
--n-gpu-layers 200
--override-tensor down_exps=CPU,gate_exps=CPU,up_exps=CPU
--parallel 1
--threads 32
--threads-batch 128

main: n_kv_max = 32768, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_layers = 200, n_threads = 32, n_threads_batch = 128

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
512	128	0	4.003	127.90	7.029	18.21
512	128	512	4.034	126.92	7.242	17.67
512	128	1024	4.053	126.31	7.405	17.29
512	128	1536	4.088	125.24	7.413	17.27
512	128	2048	4.139	123.70	7.348	17.42
512	128	2560	4.163	122.98	7.462	17.15
512	128	3072	4.217	121.40	7.516	17.03
512	128	3584	4.242	120.71	7.638	16.76
512	128	4096	4.280	119.62	7.570	16.91
512	128	4608	4.304	118.96	7.586	16.87
512	128	5120	4.335	118.12	7.712	16.60
512	128	5632	4.362	117.39	7.766	16.48
512	128	6144	4.425	115.70	7.754	16.51
512	128	6656	4.449	115.09	7.876	16.25
512	128	7168	4.518	113.33	7.936	16.13
512	128	7680	4.542	112.72	7.988	16.02
512	128	8192	4.606	111.17	7.981	16.04
512	128	8704	4.646	110.21	7.936	16.13
512	128	9216	4.685	109.29	8.034	15.93
512	128	9728	4.714	108.61	8.257	15.50
512	128	10240	4.771	107.32	8.238	15.54
512	128	10752	4.808	106.48	8.157	15.69
512	128	11264	4.838	105.84	8.429	15.19
512	128	11776	4.897	104.55	8.279	15.46
512	128	12288	4.930	103.86	8.452	15.15
512	128	12800	4.976	102.89	8.512	15.04
512	128	13312	5.025	101.89	8.732	14.66
512	128	13824	5.050	101.38	8.483	15.09
512	128	14336	5.097	100.46	8.608	14.87
512	128	14848	5.131	99.79	8.636	14.82
512	128	15360	5.177	98.90	8.769	14.60
512	128	15872	5.249	97.55	9.109	14.05
512	128	16384	5.421	94.45	8.999	14.22
512	128	16896	5.470	93.61	9.044	14.15
512	128	17408	5.468	93.63	9.073	14.11
512	128	17920	5.520	92.76	8.868	14.43
512	128	18432	5.559	92.10	8.917	14.35
512	128	18944	5.600	91.43	9.064	14.12
512	128	19456	5.645	90.69	9.051	14.14
512	128	19968	5.726	89.42	9.059	14.13
512	128	20480	5.737	89.25	9.306	13.75
512	128	20992	5.808	88.16	9.162	13.97
512	128	21504	5.817	88.02	9.372	13.66
512	128	22016	5.899	86.80	9.476	13.51
512	128	22528	5.958	85.94	9.503	13.47
512	128	23040	6.022	85.03	9.457	13.53
512	128	23552	5.869	87.23	9.531	13.43
512	128	24064	5.886	86.98	9.630	13.29
512	128	24576	5.949	86.07	9.768	13.10
512	128	25088	5.927	86.39	9.716	13.17
512	128	25600	5.971	85.74	9.775	13.10
512	128	26112	6.047	84.67	9.837	13.01
512	128	26624	6.094	84.02	9.736	13.15
512	128	27136	6.136	83.44	9.882	12.95
512	128	27648	6.189	82.73	9.924	12.90
512	128	28160	6.217	82.36	9.903	12.93
512	128	28672	6.274	81.61	9.972	12.84
512	128	29184	6.297	81.31	9.965	12.84
512	128	29696	6.354	80.57	10.105	12.67
512	128	30208	6.401	79.99	10.188	12.56
512	128	30720	6.429	79.64	10.216	12.53
512	128	31232	6.475	79.07	10.275	12.46
512	128	31744	6.527	78.44	10.285	12.44
512	128	32256	6.540	78.29	10.392	12.32

--override-tensor exps=CPU

./build/bin/llama-sweep-bench
--alias unsloth/DeepSeek-R1-UD-Q2_K_XL
--model /mnt/models/deepseek-ai/DeepSeek-R1-UD-Q2_K_XL/DeepSeek-R1-UD-Q2_K_XL.gguf
--run-time-repack
--no-mmap
-ctk f16 -ctv f16
-mla 2 -fa
-amb 1024
-fmoe
--ctx-size 32768
-ub 512
--n-gpu-layers 200
--override-tensor exps=CPU
--parallel 1
--threads 32
--threads-batch 128

main: n_kv_max = 32768, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_layers = 200, n_threads = 32, n_threads_batch = 128

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
512	128	0	4.041	126.72	7.106	18.01
512	128	512	4.059	126.14	7.887	16.23
512	128	1024	4.098	124.93	7.855	16.30
512	128	1536	4.124	124.14	7.999	16.00
512	128	2048	4.178	122.56	7.412	17.27
512	128	2560	4.224	121.21	7.608	16.83
512	128	3072	4.231	121.00	7.638	16.76
512	128	3584	4.261	120.17	7.620	16.80
512	128	4096	4.295	119.20	7.623	16.79
512	128	4608	4.308	118.84	7.647	16.74
512	128	5120	4.354	117.58	7.763	16.49
512	128	5632	4.390	116.63	7.799	16.41
512	128	6144	4.462	114.74	8.017	15.97
512	128	6656	4.466	114.66	8.159	15.69
512	128	7168	4.511	113.50	8.038	15.92
512	128	7680	4.552	112.47	8.243	15.53
512	128	8192	4.598	111.34	7.836	16.34
512	128	8704	4.645	110.22	8.037	15.93
512	128	9216	4.686	109.27	8.136	15.73
512	128	9728	4.707	108.76	8.221	15.57
512	128	10240	4.785	107.00	8.393	15.25
512	128	10752	4.809	106.46	8.372	15.29
512	128	11264	4.854	105.49	8.360	15.31
512	128	11776	4.931	103.83	8.572	14.93
512	128	12288	4.952	103.39	8.564	14.95
512	128	12800	5.013	102.13	8.859	14.45
512	128	13312	5.051	101.36	8.738	14.65
512	128	13824	5.073	100.93	8.513	15.04
512	128	14336	5.097	100.46	8.567	14.94
512	128	14848	5.155	99.33	8.600	14.88
512	128	15360	5.187	98.71	8.709	14.70
512	128	15872	5.220	98.08	8.800	14.54
512	128	16384	5.393	94.94	8.739	14.65
512	128	16896	5.419	94.48	8.830	14.50
512	128	17408	5.476	93.50	8.844	14.47
512	128	17920	5.522	92.73	8.829	14.50
512	128	18432	5.564	92.02	8.980	14.25
512	128	18944	5.596	91.49	8.983	14.25
512	128	19456	5.672	90.27	9.139	14.01
512	128	19968	5.698	89.86	9.153	13.98
512	128	20480	5.724	89.45	9.259	13.82
512	128	20992	5.788	88.46	9.125	14.03
512	128	21504	5.820	87.97	9.241	13.85
512	128	22016	5.896	86.84	9.392	13.63
512	128	22528	6.010	85.19	9.569	13.38
512	128	23040	6.012	85.16	9.695	13.20
512	128	23552	5.915	86.55	9.488	13.49
512	128	24064	5.907	86.68	9.490	13.49
512	128	24576	5.903	86.74	9.614	13.31
512	128	25088	5.929	86.35	9.688	13.21
512	128	25600	6.021	85.03	9.701	13.19
512	128	26112	6.154	83.19	9.722	13.17
512	128	26624	6.163	83.07	10.042	12.75
512	128	27136	6.238	82.07	9.866	12.97
512	128	27648	6.298	81.29	10.199	12.55
512	128	28160	6.363	80.46	10.197	12.55
512	128	28672	6.287	81.44	10.276	12.46
512	128	29184	6.310	81.14	9.948	12.87
512	128	29696	6.411	79.87	10.264	12.47
512	128	30208	6.489	78.90	10.408	12.30
512	128	30720	6.480	79.01	10.365	12.35
512	128	31232	6.597	77.61	10.456	12.24
512	128	31744	6.530	78.41	10.365	12.35
512	128	32256	6.628	77.25	10.444	12.26

ubergarm/DeepSeek-V3-0324-IQ2_K_R4

--override-tensor down_exps=CPU,gate_exps=CPU,up_exps=CPU

./build/bin/llama-sweep-bench
--alias ubergarm/DeepSeek-V3-0324-IQ2_K_R4
--model /mnt/models/deepseek-ai/DeepSeek-V3-0324-IQ2_K_R4/DeepSeek-V3-0324-IQ2_K_R4-00001-of-00005.gguf
--run-time-repack
--no-mmap
-ctk q8_0
-mla 2 -fa
-amb 1024
-fmoe
--ctx-size 32768
-ub 512
--n-gpu-layers 200
--override-tensor down_exps=CPU,gate_exps=CPU,up_exps=CPU
--parallel 1
--threads 32
--threads-batch 128

main: n_kv_max = 32768, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_layers = 200, n_threads = 32, n_threads_batch = 128

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
512	128	0	4.350	117.69	8.328	15.37
512	128	512	4.361	117.42	8.260	15.50
512	128	1024	4.398	116.42	8.622	14.85
512	128	1536	4.440	115.31	8.632	14.83
512	128	2048	4.467	114.61	8.652	14.79
512	128	2560	4.501	113.75	9.231	13.87
512	128	3072	4.566	112.13	8.970	14.27
512	128	3584	4.594	111.44	8.700	14.71
512	128	4096	4.609	111.09	8.996	14.23
512	128	4608	4.655	110.00	8.935	14.33
512	128	5120	4.701	108.92	8.879	14.42
512	128	5632	4.756	107.66	9.050	14.14
512	128	6144	4.760	107.57	9.359	13.68
512	128	6656	4.795	106.78	9.247	13.84
512	128	7168	4.836	105.88	9.250	13.84
512	128	7680	4.873	105.07	9.421	13.59
512	128	8192	4.939	103.66	9.491	13.49
512	128	8704	4.986	102.70	9.231	13.87
512	128	9216	5.033	101.74	9.319	13.74
512	128	9728	5.059	101.22	9.467	13.52
512	128	10240	5.106	100.28	9.500	13.47
512	128	10752	5.155	99.33	9.485	13.50
512	128	11264	5.190	98.66	9.578	13.36
512	128	11776	5.238	97.74	9.651	13.26
512	128	12288	5.315	96.32	9.913	12.91
512	128	12800	5.319	96.26	10.666	12.00
512	128	13312	5.382	95.13	9.888	12.95
512	128	13824	5.418	94.50	9.937	12.88
512	128	14336	5.475	93.51	10.205	12.54
512	128	14848	5.474	93.53	9.936	12.88
512	128	15360	5.503	93.04	9.931	12.89
512	128	15872	5.551	92.23	9.928	12.89
512	128	16384	5.726	89.41	10.235	12.51
512	128	16896	5.757	88.93	10.154	12.61
512	128	17408	5.849	87.54	10.392	12.32
512	128	17920	5.951	86.03	10.163	12.59
512	128	18432	5.893	86.88	10.108	12.66
512	128	18944	5.928	86.37	10.283	12.45
512	128	19456	5.949	86.06	10.394	12.31
512	128	19968	6.029	84.92	10.557	12.12
512	128	20480	6.029	84.92	10.507	12.18
512	128	20992	6.078	84.24	10.565	12.12
512	128	21504	6.111	83.78	10.404	12.30
512	128	22016	6.158	83.14	10.648	12.02
512	128	22528	6.195	82.64	10.623	12.05
512	128	23040	6.255	81.85	10.795	11.86
512	128	23552	6.191	82.70	10.728	11.93
512	128	24064	6.204	82.53	10.805	11.85
512	128	24576	6.261	81.77	10.975	11.66
512	128	25088	6.301	81.25	10.903	11.74
512	128	25600	6.351	80.62	11.110	11.52
512	128	26112	6.374	80.33	10.962	11.68
512	128	26624	6.433	79.59	10.960	11.68
512	128	27136	6.478	79.04	11.133	11.50
512	128	27648	6.509	78.66	11.222	11.41
512	128	28160	6.543	78.26	11.193	11.44
512	128	28672	6.597	77.61	11.351	11.28
512	128	29184	6.634	77.18	11.231	11.40
512	128	29696	6.667	76.80	11.568	11.06
512	128	30208	6.771	75.62	11.527	11.10
512	128	30720	6.764	75.70	11.581	11.05
512	128	31232	6.801	75.29	11.443	11.19
512	128	31744	6.865	74.58	11.446	11.18
512	128	32256	6.888	74.33	11.558	11.07

--override-tensor exps=CPU

./build/bin/llama-sweep-bench
--alias ubergarm/DeepSeek-V3-0324-IQ2_K_R4
--model /mnt/models/deepseek-ai/DeepSeek-V3-0324-IQ2_K_R4/DeepSeek-V3-0324-IQ2_K_R4-00001-of-00005.gguf
--run-time-repack
--no-mmap
-ctk q8_0
-mla 2 -fa
-amb 1024
-fmoe
--ctx-size 32768
-ub 512
--n-gpu-layers 200
--override-tensor exps=CPU
--parallel 1
--threads 32
--threads-batch 128

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
512	128	0	4.330	118.24	8.120	15.76
512	128	512	4.330	118.23	8.315	15.39
512	128	1024	4.380	116.90	8.239	15.54
512	128	1536	4.419	115.87	8.571	14.93
512	128	2048	4.467	114.62	8.616	14.86
512	128	2560	4.543	112.71	8.923	14.35
512	128	3072	4.570	112.05	9.140	14.00
512	128	3584	4.619	110.85	8.797	14.55
512	128	4096	4.645	110.23	9.397	13.62
512	128	4608	4.691	109.14	9.114	14.04
512	128	5120	4.764	107.48	9.182	13.94
512	128	5632	4.716	108.57	9.477	13.51
512	128	6144	4.816	106.32	9.217	13.89
512	128	6656	4.811	106.43	9.626	13.30
512	128	7168	4.863	105.28	9.594	13.34
512	128	7680	4.905	104.38	9.384	13.64
512	128	8192	4.931	103.84	9.389	13.63
512	128	8704	4.980	102.82	9.203	13.91
512	128	9216	5.005	102.30	9.403	13.61
512	128	9728	5.052	101.34	9.254	13.83
512	128	10240	5.215	98.17	9.835	13.02
512	128	10752	5.152	99.38	9.910	12.92
512	128	11264	5.230	97.89	9.746	13.13
512	128	11776	5.275	97.06	9.928	12.89
512	128	12288	5.277	97.03	9.837	13.01
512	128	12800	5.317	96.30	10.236	12.50
512	128	13312	5.342	95.84	10.023	12.77
512	128	13824	5.431	94.27	9.999	12.80
512	128	14336	5.497	93.14	10.285	12.45
512	128	14848	5.604	91.37	10.568	12.11
512	128	15360	5.597	91.48	10.124	12.64
512	128	15872	5.640	90.78	10.218	12.53
512	128	16384	5.814	88.06	10.254	12.48
512	128	16896	5.855	87.45	10.448	12.25
512	128	17408	5.806	88.19	10.499	12.19
512	128	17920	5.900	86.78	10.420	12.28
512	128	18432	5.974	85.71	10.529	12.16
512	128	18944	5.941	86.18	10.273	12.46
512	128	19456	5.978	85.65	10.678	11.99
512	128	19968	6.095	84.01	10.653	12.02
512	128	20480	6.161	83.11	10.883	11.76
512	128	20992	6.243	82.01	10.895	11.75
512	128	21504	6.109	83.80	10.525	12.16
512	128	22016	6.157	83.16	10.673	11.99
512	128	22528	6.221	82.31	10.789	11.86
512	128	23040	6.282	81.50	11.070	11.56
512	128	23552	6.261	81.78	11.337	11.29
512	128	24064	6.303	81.24	10.997	11.64
512	128	24576	6.262	81.77	10.803	11.85
512	128	25088	6.320	81.02	10.864	11.78
512	128	25600	6.460	79.26	10.962	11.68
512	128	26112	6.418	79.77	11.359	11.27
512	128	26624	6.436	79.55	11.038	11.60
512	128	27136	6.518	78.55	11.211	11.42
512	128	27648	6.605	77.52	11.407	11.22
512	128	28160	6.690	76.53	11.495	11.14
512	128	28672	6.651	76.98	11.358	11.27
512	128	29184	6.680	76.65	11.737	10.91
512	128	29696	6.677	76.68	11.371	11.26
512	128	30208	6.739	75.97	11.278	11.35
512	128	30720	6.768	75.65	11.427	11.20
512	128	31232	6.820	75.07	11.517	11.11
512	128	31744	6.849	74.76	11.387	11.24
512	128	32256	6.936	73.82	11.624	11.01

7 replies

ikawrakow Apr 9, 2025
Maintainer

ik is potentially faster or at least on-par with ktransformers speed.

So, where are my 13k stars? One also has a longer context and better quantization options available...

saood06 Apr 10, 2025
Collaborator

@saood06 You said somewhere that KTransformers was the fastest toolkit for DeepSeek inference. This is not faster?

I said something to that tune on Feb 19, ik_llama.cpp has improved a lot since then. Even then and still now I still see ktransformers as more of a performance demo because of how limited it is in what it supports both in hardware and the server/API they expose.

So, where are my 13k stars?

I was never sure if you wanted more publicity, I always offered technical support and explanations whenever ik_llama.cpp was brought up and only brought it up when it was relevant to discussions, but there were times I felt like I could have posted about it and gotten strong reception but I never did because I wasn't sure if you wanted this project to be popular.

One also has a longer context and better quantization options available...

I find this repo amazing, and it is full of options, but popularity and quality aren't linked. Your bitnet implementation is far better than the popular Microsoft one, but the Microsoft one (which also has 13k stars), is far better known.

ikawrakow Apr 10, 2025
Maintainer

I felt like I could have posted about it and gotten strong reception but I never did because I wasn't sure if you wanted this project to be popular.

I'm not necessarily looking for popularity (as you say, the correlation between popularity and quality is not very strong), but KTransformers copying code from here without acknowledgement (see #319) does rub me the wrong way. You can for sure post about that. And I'm now thinking that if this repository was better known, perhaps they wouldn't do it so blatantly. They do acknowledge to have taken the CPU implementation from llamafile, but llamafile is not a competitor (doesn't even support DeepSeek models), while ik_llama.cpp definitely is.

saood06 Apr 10, 2025
Collaborator

I'm not necessarily looking for popularity (as you say, the correlation between popularity and quality is not very strong), but KTransformers copying code from here without acknowledgement (see #319) does rub me the wrong way. You can for sure post about that.

I saw that discussion, and I wasn't really happy with it either, but that isn't the sort of thing I would post about. My potential posts were more feature/performance highlights.

And I'm now thinking that if this repository was better known, perhaps they wouldn't do it so blatantly. They do acknowledge to have taken the CPU implementation from llamafile.

That may have helped avoid the situation.

but llamafile is not a competitor (doesn't even support DeepSeek models), while ik_llama.cpp definitely is.

I really don't see the different inference engine as competitors, they just serve different niches.

ubergarm Apr 10, 2025
Author

@anikiforovopensource

One last quick tip, if you want to sacrifice some quality in exchange for extra speed add -ser 6,1 to your command. Details on that feature are in PR#239.

anikifoss · 2025-04-11T15:36:54Z

anikifoss
Apr 11, 2025

@ubergarm I incorporated some of your suggestions and re-run the benchmark.

Both --override-tensor down_exps=CPU,gate_exps=CPU,up_exps=CPU and --override-tensor exps=CPU are doing the exact same thing. Given it is a regular expression, -ot exps=CPU matches down_exps/gate_exps/up_exps. So really there are only two different comparisons, each one run twice. So your _more and _attn trials are the same. Good to see there is repeatability.

I ran gguf-dump and found more smaller layers, so I'm trying offload onto the GPU as much as possible, for example:

     40:    4128768 |  7168,   576,     1,     1 | Q6_K    | blk.3.attn_kv_a_mqa.weight
     41:        512 |   512,     1,     1,     1 | F32     | blk.3.attn_kv_a_norm.weight
     42:   16777216 |   512, 32768,     1,     1 | Q6_K    | blk.3.attn_kv_b.weight
     43:       7168 |  7168,     1,     1,     1 | F32     | blk.3.attn_norm.weight
     44:  117440512 | 16384,  7168,     1,     1 | Q4_K    | blk.3.attn_output.weight
     45:   11010048 |  7168,  1536,     1,     1 | Q4_K    | blk.3.attn_q_a.weight
     46:       1536 |  1536,     1,     1,     1 | F32     | blk.3.attn_q_a_norm.weight
     47:   37748736 |  1536, 24576,     1,     1 | Q4_K    | blk.3.attn_q_b.weight
     48:        256 |   256,     1,     1,     1 | F32     | blk.3.exp_probs_b.bias
     49: 3758096384 |  2048,  7168,   256,     1 | Q3_K    | blk.3.ffn_down_exps.weight
     50:   14680064 |  2048,  7168,     1,     1 | Q6_K    | blk.3.ffn_down_shexp.weight
     51: 3758096384 |  7168,  2048,   256,     1 | Q2_K    | blk.3.ffn_gate_exps.weight
     52:    1835008 |  7168,   256,     1,     1 | F32     | blk.3.ffn_gate_inp.weight
     53:   14680064 |  7168,  2048,     1,     1 | Q4_K    | blk.3.ffn_gate_shexp.weight
     54:       7168 |  7168,     1,     1,     1 | F32     | blk.3.ffn_norm.weight
     55: 3758096384 |  7168,  2048,   256,     1 | Q2_K    | blk.3.ffn_up_exps.weight
     56:   14680064 |  7168,  2048,     1,     1 | Q4_K    | blk.3.ffn_up_shexp.weight

You specified -ctk f16 -ctv f16 for the unsloth quant, anymore I only specify -ctk q8_0 and no need to specify -ctv when using MLA psure. q8_0 is fine for context especially with this lower quant mix.

From my tests, -ctk f16 -ctv f16 is faster than -ctk q8_0 (see the new benchmark results).

You could definitely run the bigger IQ4_K_R4 given you have enough RAM in a single NUMA node (BIOS NPS1). It will get you almost original quality perplexity with a trade-off in slightly slower speed.

I prefer to run R1 instead of V3, so I currently don't have the quant to utilize more RAM. I can run benchmarks on your DS-R1 671B ubergarm IQ2_XS_R4 and DS-R1 671B ubergarm Q2_K_R4 quants if you share those.

Benchmark results (system: 7975wx with FCLK=2100 , RAM at 5600MHz, RTX 5090):

-ctk f16 -ctv f16 with first 3 experts fully offloaded onto the GPU
-ctk f16 -ctv f16 with all experts on the CPU
-ctk q8_0 with all experts on the CPU
-ctk f16 -ctv f16 with no GPU

Partial benchmark logs

GPU

-ctk f16 -ctv f16, --override-tensor all_but_3_exps

VRAM: 30G, RAM: 216G

./build/bin/llama-sweep-bench
--alias unsloth/DeepSeek-R1-UD-Q2_K_XL
--model /mnt/models/deepseek-ai/DeepSeek-R1-UD-Q2_K_XL/DeepSeek-R1-UD-Q2_K_XL.gguf
--run-time-repack
--no-mmap
-ctk f16 -ctv f16
-mla 2 -fa
-amb 1024
-fmoe
--ctx-size 32768
-ub 512
--n-gpu-layers 200
--override-tensor 6.ffn_down_exps=CPU,6.ffn_gate_exps=CPU,6.ffn_up_exps=CPU,7.ffn_down_exps=CPU,7.ffn_gate_exps=CPU,7.ffn_up_exps=CPU,8.ffn_down_exps=CPU,8.ffn_gate_exps=CPU,8.ffn_up_exps=CPU,9.ffn_down_exps=CPU,9.ffn_gate_exps=CPU,9.ffn_up_exps=CPU,10.ffn_down_exps=CPU,10.ffn_gate_exps=CPU,10.ffn_up_exps=CPU,11.ffn_down_exps=CPU,11.ffn_gate_exps=CPU,11.ffn_up_exps=CPU,12.ffn_down_exps=CPU,12.ffn_gate_exps=CPU,12.ffn_up_exps=CPU,13.ffn_down_exps=CPU,13.ffn_gate_exps=CPU,13.ffn_up_exps=CPU,14.ffn_down_exps=CPU,14.ffn_gate_exps=CPU,14.ffn_up_exps=CPU,15.ffn_down_exps=CPU,15.ffn_gate_exps=CPU,15.ffn_up_exps=CPU,16.ffn_down_exps=CPU,16.ffn_gate_exps=CPU,16.ffn_up_exps=CPU,17.ffn_down_exps=CPU,17.ffn_gate_exps=CPU,17.ffn_up_exps=CPU,18.ffn_down_exps=CPU,18.ffn_gate_exps=CPU,18.ffn_up_exps=CPU,19.ffn_down_exps=CPU,19.ffn_gate_exps=CPU,19.ffn_up_exps=CPU,20.ffn_down_exps=CPU,20.ffn_gate_exps=CPU,20.ffn_up_exps=CPU,21.ffn_down_exps=CPU,21.ffn_gate_exps=CPU,21.ffn_up_exps=CPU,22.ffn_down_exps=CPU,22.ffn_gate_exps=CPU,22.ffn_up_exps=CPU,23.ffn_down_exps=CPU,23.ffn_gate_exps=CPU,23.ffn_up_exps=CPU,24.ffn_down_exps=CPU,24.ffn_gate_exps=CPU,24.ffn_up_exps=CPU,25.ffn_down_exps=CPU,25.ffn_gate_exps=CPU,25.ffn_up_exps=CPU,26.ffn_down_exps=CPU,26.ffn_gate_exps=CPU,26.ffn_up_exps=CPU,27.ffn_down_exps=CPU,27.ffn_gate_exps=CPU,27.ffn_up_exps=CPU,28.ffn_down_exps=CPU,28.ffn_gate_exps=CPU,28.ffn_up_exps=CPU,29.ffn_down_exps=CPU,29.ffn_gate_exps=CPU,29.ffn_up_exps=CPU,30.ffn_down_exps=CPU,30.ffn_gate_exps=CPU,30.ffn_up_exps=CPU,31.ffn_down_exps=CPU,31.ffn_gate_exps=CPU,31.ffn_up_exps=CPU,32.ffn_down_exps=CPU,32.ffn_gate_exps=CPU,32.ffn_up_exps=CPU,33.ffn_down_exps=CPU,33.ffn_gate_exps=CPU,33.ffn_up_exps=CPU,34.ffn_down_exps=CPU,34.ffn_gate_exps=CPU,34.ffn_up_exps=CPU,35.ffn_down_exps=CPU,35.ffn_gate_exps=CPU,35.ffn_up_exps=CPU,36.ffn_down_exps=CPU,36.ffn_gate_exps=CPU,36.ffn_up_exps=CPU,37.ffn_down_exps=CPU,37.ffn_gate_exps=CPU,37.ffn_up_exps=CPU,38.ffn_down_exps=CPU,38.ffn_gate_exps=CPU,38.ffn_up_exps=CPU,39.ffn_down_exps=CPU,39.ffn_gate_exps=CPU,39.ffn_up_exps=CPU,40.ffn_down_exps=CPU,40.ffn_gate_exps=CPU,40.ffn_up_exps=CPU,41.ffn_down_exps=CPU,41.ffn_gate_exps=CPU,41.ffn_up_exps=CPU,42.ffn_down_exps=CPU,42.ffn_gate_exps=CPU,42.ffn_up_exps=CPU,43.ffn_down_exps=CPU,43.ffn_gate_exps=CPU,43.ffn_up_exps=CPU,44.ffn_down_exps=CPU,44.ffn_gate_exps=CPU,44.ffn_up_exps=CPU,45.ffn_down_exps=CPU,45.ffn_gate_exps=CPU,45.ffn_up_exps=CPU,46.ffn_down_exps=CPU,46.ffn_gate_exps=CPU,46.ffn_up_exps=CPU,47.ffn_down_exps=CPU,47.ffn_gate_exps=CPU,47.ffn_up_exps=CPU,48.ffn_down_exps=CPU,48.ffn_gate_exps=CPU,48.ffn_up_exps=CPU,49.ffn_down_exps=CPU,49.ffn_gate_exps=CPU,49.ffn_up_exps=CPU,50.ffn_down_exps=CPU,50.ffn_gate_exps=CPU,50.ffn_up_exps=CPU,51.ffn_down_exps=CPU,51.ffn_gate_exps=CPU,51.ffn_up_exps=CPU,52.ffn_down_exps=CPU,52.ffn_gate_exps=CPU,52.ffn_up_exps=CPU,53.ffn_down_exps=CPU,53.ffn_gate_exps=CPU,53.ffn_up_exps=CPU,54.ffn_down_exps=CPU,54.ffn_gate_exps=CPU,54.ffn_up_exps=CPU,55.ffn_down_exps=CPU,55.ffn_gate_exps=CPU,55.ffn_up_exps=CPU,56.ffn_down_exps=CPU,56.ffn_gate_exps=CPU,56.ffn_up_exps=CPU,57.ffn_down_exps=CPU,57.ffn_gate_exps=CPU,57.ffn_up_exps=CPU,58.ffn_down_exps=CPU,58.ffn_gate_exps=CPU,58.ffn_up_exps=CPU,59.ffn_down_exps=CPU,59.ffn_gate_exps=CPU,59.ffn_up_exps=CPU,60.ffn_down_exps=CPU,60.ffn_gate_exps=CPU,60.ffn_up_exps=CPU
--parallel 1
--threads 32

main: n_kv_max = 32768, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_layers = 200, n_threads = 32, n_threads_batch = 32

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
512	128	0	3.483	147.00	7.086	18.06
512	128	512	3.586	142.79	7.139	17.93
512	128	1024	4.750	107.80	7.262	17.63
512	128	1536	3.795	134.92	7.177	17.83
512	128	2048	4.432	115.53	7.133	17.94
512	128	2560	5.032	101.75	7.272	17.60
512	128	3072	3.625	141.26	7.220	17.73
512	128	3584	4.195	122.04	7.565	16.92
512	128	4096	5.331	96.04	7.525	17.01
512	128	4608	4.207	121.70	7.799	16.41
512	128	5120	4.043	126.62	7.914	16.17
512	128	5632	4.568	112.09	7.672	16.68
512	128	6144	5.210	98.28	7.681	16.66
512	128	6656	4.640	110.34	8.177	15.65
512	128	7168	5.266	97.22	7.647	16.74
512	128	7680	4.113	124.49	7.870	16.26
512	128	8192	4.108	124.64	7.844	16.32
512	128	8704	4.145	123.51	8.036	15.93
512	128	9216	4.924	103.98	8.235	15.54
512	128	9728	4.349	117.72	7.951	16.10
512	128	10240	4.192	122.13	7.845	16.32
512	128	10752	4.229	121.08	7.798	16.41
512	128	11264	4.324	118.40	7.876	16.25
512	128	11776	5.983	85.58	8.406	15.23
512	128	12288	6.235	82.12	8.470	15.11
512	128	12800	5.358	95.56	8.495	15.07
512	128	13312	5.793	88.38	8.264	15.49
512	128	13824	5.758	88.92	8.450	15.15
512	128	14336	6.229	82.19	8.483	15.09
512	128	14848	5.692	89.95	8.696	14.72
512	128	15360	5.541	92.39	8.659	14.78
512	128	15872	4.766	107.42	8.626	14.84
512	128	16384	4.902	104.45	8.613	14.86
512	128	16896	5.080	100.78	8.512	15.04
512	128	17408	5.087	100.64	8.479	15.10
512	128	17920	5.986	85.54	8.614	14.86
512	128	18432	6.323	80.97	8.775	14.59
512	128	18944	5.914	86.58	8.760	14.61
512	128	19456	5.382	95.13	8.708	14.70
512	128	19968	5.111	100.19	8.703	14.71
512	128	20480	5.287	96.85	8.849	14.47
512	128	20992	5.949	86.06	9.010	14.21
512	128	21504	6.323	80.97	9.487	13.49
512	128	22016	5.922	86.45	9.215	13.89
512	128	22528	5.324	96.16	9.090	14.08
512	128	23040	5.939	86.21	9.080	14.10
512	128	23552	5.323	96.19	9.308	13.75
512	128	24064	5.610	91.27	9.150	13.99
512	128	24576	5.433	94.25	9.219	13.88
512	128	25088	5.394	94.92	9.244	13.85
512	128	25600	5.560	92.09	9.303	13.76
512	128	26112	5.625	91.02	9.380	13.65
512	128	26624	5.622	91.07	9.386	13.64
512	128	27136	5.592	91.56	9.465	13.52
512	128	27648	5.689	89.99	9.489	13.49
512	128	28160	5.653	90.57	9.555	13.40
512	128	28672	5.727	89.40	9.560	13.39
512	128	29184	5.752	89.01	9.612	13.32
512	128	29696	5.764	88.82	9.681	13.22
512	128	30208	5.797	88.32	9.714	13.18
512	128	30720	5.821	87.96	9.775	13.09
512	128	31232	5.881	87.06	9.826	13.03
512	128	31744	5.908	86.66	9.895	12.94
512	128	32256	5.934	86.29	9.920	12.90

GPU (best so far)

-ctk f16 -ctv f16, --override-tensor down_exps=CPU,gate_exps=CPU,up_exps=CPU

VRAM: 18.5G, RAM: 228G

./build/bin/llama-sweep-bench
--alias unsloth/DeepSeek-R1-UD-Q2_K_XL
--model /mnt/models/deepseek-ai/DeepSeek-R1-UD-Q2_K_XL/DeepSeek-R1-UD-Q2_K_XL.gguf
--run-time-repack
--no-mmap
-ctk f16 -ctv f16
-mla 2 -fa
-amb 1024
-fmoe
--ctx-size 32768
-ub 512
--n-gpu-layers 200
--override-tensor down_exps=CPU,gate_exps=CPU,up_exps=CPU
--parallel 1
--threads 32

main: n_kv_max = 32768, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_layers = 200, n_threads = 32, n_threads_batch = 32

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
512	128	0	3.643	140.54	7.130	17.95
512	128	512	3.681	139.08	7.104	18.02
512	128	1024	4.019	127.39	7.177	17.83
512	128	1536	3.665	139.70	7.243	17.67
512	128	2048	3.680	139.13	7.266	17.62
512	128	2560	4.598	111.34	7.285	17.57
512	128	3072	3.884	131.83	7.342	17.43
512	128	3584	3.745	136.71	7.394	17.31
512	128	4096	4.303	118.99	7.463	17.15
512	128	4608	4.421	115.81	7.551	16.95
512	128	5120	4.159	123.12	7.604	16.83
512	128	5632	4.138	123.74	7.592	16.86
512	128	6144	4.053	126.33	7.649	16.74
512	128	6656	4.297	119.17	7.731	16.56
512	128	7168	4.133	123.88	7.768	16.48
512	128	7680	5.511	92.90	7.795	16.42
512	128	8192	4.164	122.97	7.917	16.17
512	128	8704	4.160	123.07	7.866	16.27
512	128	9216	4.203	121.83	7.909	16.19
512	128	9728	4.721	108.45	8.027	15.95
512	128	10240	4.720	108.48	8.026	15.95
512	128	10752	4.422	115.77	8.041	15.92
512	128	11264	4.682	109.36	8.089	15.82
512	128	11776	4.419	115.87	8.125	15.75
512	128	12288	4.446	115.16	8.188	15.63
512	128	12800	4.470	114.54	8.293	15.43
512	128	13312	4.896	104.58	8.345	15.34
512	128	13824	4.593	111.46	8.402	15.23
512	128	14336	4.652	110.06	8.481	15.09
512	128	14848	4.649	110.14	8.535	15.00
512	128	15360	4.731	108.21	8.512	15.04
512	128	15872	4.738	108.05	8.570	14.94
512	128	16384	4.895	104.59	8.592	14.90
512	128	16896	4.944	103.55	8.647	14.80
512	128	17408	6.140	83.39	8.738	14.65
512	128	17920	6.833	74.94	9.564	13.38
512	128	18432	5.571	91.90	9.122	14.03
512	128	18944	6.351	80.62	9.246	13.84
512	128	19456	5.668	90.33	9.256	13.83
512	128	19968	7.063	72.49	9.243	13.85
512	128	20480	5.548	92.29	9.477	13.51
512	128	20992	6.814	75.14	9.710	13.18
512	128	21504	6.293	81.37	9.490	13.49
512	128	22016	6.535	78.35	9.666	13.24
512	128	22528	5.550	92.25	9.764	13.11
512	128	23040	5.926	86.40	9.460	13.53
512	128	23552	5.482	93.40	9.766	13.11
512	128	24064	5.667	90.36	9.816	13.04
512	128	24576	5.696	89.89	9.596	13.34
512	128	25088	5.613	91.22	9.505	13.47
512	128	25600	5.604	91.36	9.529	13.43
512	128	26112	5.630	90.94	9.794	13.07
512	128	26624	5.657	90.51	9.796	13.07
512	128	27136	5.720	89.51	9.771	13.10
512	128	27648	5.843	87.62	9.736	13.15
512	128	28160	5.869	87.24	9.869	12.97
512	128	28672	5.818	88.00	9.837	13.01
512	128	29184	5.865	87.30	9.894	12.94
512	128	29696	5.898	86.81	9.912	12.91
512	128	30208	5.918	86.52	9.995	12.81
512	128	30720	5.938	86.22	10.068	12.71
512	128	31232	5.986	85.53	10.133	12.63
512	128	31744	6.001	85.32	10.145	12.62
512	128	32256	6.067	84.39	10.210	12.54

GPU

-ctk q8_0

VRAM: 17.5G, RAM: 228G

./build/bin/llama-sweep-bench
--alias unsloth/DeepSeek-R1-UD-Q2_K_XL
--model /mnt/models/deepseek-ai/DeepSeek-R1-UD-Q2_K_XL/DeepSeek-R1-UD-Q2_K_XL.gguf
--run-time-repack
--no-mmap
-ctk q8_0
-mla 2 -fa
-amb 1024
-fmoe
--ctx-size 32768
-ub 512
--n-gpu-layers 200
--override-tensor down_exps=CPU,gate_exps=CPU,up_exps=CPU
--parallel 1
--threads 32

main: n_kv_max = 32768, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_layers = 200, n_threads = 32, n_threads_batch = 32

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
512	128	0	4.096	124.99	7.672	16.69
512	128	512	3.678	139.20	7.677	16.67
512	128	1024	3.979	128.69	7.566	16.92
512	128	1536	3.872	132.23	7.561	16.93
512	128	2048	3.740	136.89	7.633	16.77
512	128	2560	3.856	132.78	7.684	16.66
512	128	3072	3.720	137.63	7.730	16.56
512	128	3584	4.022	127.31	7.722	16.58
512	128	4096	5.110	100.19	7.784	16.44
512	128	4608	4.787	106.95	8.121	15.76
512	128	5120	5.096	100.46	8.074	15.85
512	128	5632	4.062	126.03	8.162	15.68
512	128	6144	5.254	97.44	8.617	14.85
512	128	6656	4.788	106.94	8.147	15.71
512	128	7168	4.554	112.42	8.303	15.42
512	128	7680	5.438	94.15	8.298	15.43
512	128	8192	4.562	112.23	9.095	14.07
512	128	8704	4.309	118.83	8.437	15.17
512	128	9216	6.090	84.08	8.855	14.46
512	128	9728	4.384	116.79	8.890	14.40
512	128	10240	4.501	113.74	8.700	14.71
512	128	10752	5.173	98.98	8.756	14.62
512	128	11264	5.883	87.03	8.907	14.37
512	128	11776	5.338	95.92	9.013	14.20
512	128	12288	4.596	111.40	8.877	14.42
512	128	12800	4.989	102.62	9.279	13.80
512	128	13312	6.270	81.65	9.298	13.77
512	128	13824	6.395	80.06	9.615	13.31
512	128	14336	6.610	77.45	9.614	13.31
512	128	14848	6.563	78.02	9.810	13.05
512	128	15360	5.766	88.80	9.491	13.49
512	128	15872	5.942	86.17	9.488	13.49
512	128	16384	5.158	99.27	9.452	13.54
512	128	16896	6.553	78.14	9.518	13.45
512	128	17408	5.054	101.31	9.495	13.48
512	128	17920	5.118	100.05	9.453	13.54
512	128	18432	5.605	91.34	9.458	13.53
512	128	18944	5.161	99.20	9.610	13.32
512	128	19456	5.235	97.80	9.665	13.24
512	128	19968	5.946	86.11	9.482	13.50
512	128	20480	5.966	85.82	9.673	13.23
512	128	20992	6.732	76.05	9.690	13.21
512	128	21504	5.708	89.70	9.987	12.82
512	128	22016	5.422	94.43	9.757	13.12
512	128	22528	5.618	91.13	9.918	12.91
512	128	23040	6.370	80.38	9.888	12.94
512	128	23552	6.118	83.69	9.927	12.89
512	128	24064	5.658	90.50	10.228	12.51
512	128	24576	5.764	88.83	10.345	12.37
512	128	25088	7.223	70.89	10.030	12.76
512	128	25600	5.684	90.07	10.493	12.20
512	128	26112	6.165	83.05	10.326	12.40
512	128	26624	5.884	87.01	10.250	12.49
512	128	27136	6.007	85.24	10.161	12.60
512	128	27648	5.818	88.00	10.435	12.27
512	128	28160	5.947	86.09	10.270	12.46
512	128	28672	5.895	86.85	10.255	12.48
512	128	29184	5.879	87.09	10.382	12.33
512	128	29696	6.140	83.38	10.372	12.34
512	128	30208	6.441	79.49	10.734	11.92
512	128	30720	6.289	81.41	10.518	12.17
512	128	31232	6.314	81.09	10.602	12.07
512	128	31744	7.195	71.16	10.691	11.97
512	128	32256	6.132	83.49	10.576	12.10

CPU with ctk=f16

./build/bin/llama-sweep-bench
--alias unsloth/DeepSeek-R1-UD-Q2_K_XL
--model /mnt/models/deepseek-ai/DeepSeek-R1-UD-Q2_K_XL/DeepSeek-R1-UD-Q2_K_XL.gguf
--run-time-repack
--no-mmap
-ctk f16 -ctv f16
-mla 3 -fa
-amb 512
-fmoe
--ctx-size 32768
-ub 512
--n-gpu-layers 0
--parallel 1
--threads 32

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
512	128	0	6.693	76.50	13.734	9.32
512	128	512	7.237	70.75	14.045	9.11
512	128	1024	7.980	64.16	14.287	8.96
512	128	1536	8.488	60.32	14.600	8.77
512	128	2048	8.844	57.89	14.792	8.65
512	128	2560	9.452	54.17	15.179	8.43
512	128	3072	10.167	50.36	15.516	8.25
512	128	3584	13.772	37.18	15.650	8.18
512	128	4096	11.631	44.02	16.067	7.97
512	128	4608	12.248	41.80	16.280	7.86
512	128	5120	12.859	39.82	16.483	7.77
512	128	5632	13.792	37.12	16.788	7.62
512	128	6144	14.501	35.31	17.094	7.49
512	128	6656	15.965	32.07	17.506	7.31
512	128	7168	16.059	31.88	17.939	7.14
512	128	7680	16.810	30.46	18.098	7.07
512	128	8192	18.140	28.23	18.482	6.93
512	128	8704	18.309	27.96	18.781	6.82
512	128	9216	18.683	27.40	18.961	6.75
512	128	9728	19.460	26.31	19.409	6.59
512	128	10240	20.460	25.02	19.746	6.48
512	128	10752	20.846	24.56	19.919	6.43
512	128	11264	21.317	24.02	20.436	6.26
512	128	11776	22.945	22.31	20.508	6.24
512	128	12288	23.226	22.04	20.768	6.16
512	128	12800	23.970	21.36	21.068	6.08
512	128	13312	24.957	20.51	21.428	5.97
512	128	13824	25.210	20.31	21.920	5.84
512	128	14336	26.145	19.58	22.193	5.77
512	128	14848	26.998	18.96	22.321	5.73
512	128	15360	26.816	19.09	22.634	5.66
512	128	15872	27.456	18.65	22.988	5.57
512	128	16384	33.351	15.35	23.617	5.42
512	128	16896	30.500	16.79	24.075	5.32
512	128	17408	30.462	16.81	23.842	5.37
512	128	17920	33.618	15.23	24.286	5.27
512	128	18432	34.112	15.01	24.634	5.20
512	128	18944	35.576	14.39	24.711	5.18
512	128	19456	33.324	15.36	25.133	5.09
512	128	19968	35.278	14.51	25.442	5.03
512	128	20480	34.604	14.80	25.888	4.94
512	128	20992	36.698	13.95	26.474	4.83
512	128	21504	35.757	14.32	26.663	4.80
512	128	22016	45.165	11.34	27.099	4.72
512	128	22528	39.834	12.85	27.743	4.61
512	128	23040	38.361	13.35	27.766	4.61
512	128	23552	39.702	12.90	28.031	4.57
512	128	24064	39.953	12.81	28.079	4.56
512	128	24576	40.666	12.59	28.842	4.44
512	128	25088	41.713	12.27	28.696	4.46
512	128	25600	41.596	12.31	29.217	4.38
512	128	26112	42.487	12.05	29.505	4.34
512	128	26624	43.267	11.83	30.323	4.22
512	128	27136	44.043	11.63	30.938	4.14
512	128	27648	44.502	11.51	30.299	4.22
512	128	28160	44.618	11.48	31.427	4.07
512	128	28672	46.315	11.05	31.198	4.10
512	128	29184	48.194	10.62	31.528	4.06
512	128	29696	46.799	10.94	32.231	3.97
512	128	30208	47.748	10.72	32.316	3.96
512	128	30720	48.746	10.50	33.054	3.87
512	128	31232	52.171	9.81	32.868	3.89
512	128	31744	53.965	9.49	33.132	3.86
512	128	32256	56.242	9.10	33.238	3.85

6 replies

saood06 Apr 12, 2025
Collaborator

I prefer to run R1 instead of V3, so I currently don't have the quant to utilize more RAM.

If you have the capability, I would recommend making your own quants, that way you can optimally make them exactly to your system specs.

anikifoss Apr 12, 2025

I fixed some cooling issues with the system and re-run benmarks with ser. Also run perplexity.

Perplexity for unsloth/DeepSeek-R1-UD-Q2_K_XL (not plotting, becuse ser failed, and the ctk results are indistinguishable when plotted):

-ctk f16 -ser 7,1: [1]nan,[2]nan,[3]nan,[4]nan,[5]nan,[6]nan,[7]nan,[8]nan,
-ctk f16 -ser 6,1: [1]nan,[2]nan,[3]nan,[4]nan,[5]nan,[6]nan,[7]nan,[8]nan,
-ctk f16 -ser 5,1: [1]nan,[2]nan,[3]nan,[4]nan,[5]nan,[6]nan,[7]nan,[8]nan,
-ctk f16: [1]4.3546,[2]3.3802,[3]3.6638,[4]3.7678,[5]3.7361,[6]4.1076,[7]4.0225,[8]4.0192,
-ctk f32: [1]4.3596,[2]3.3839,[3]3.6658,[4]3.7711,[5]3.7389,[6]4.1106,[7]4.0248,[8]4.0216,
-ctk q8_0: [1]4.3602,[2]3.3846,[3]3.6666,[4]3.7718,[5]3.7395,[6]4.1110,[7]4.0260,[8]4.0223,

Benchmark results (system: 7975wx with FCLK=2100 , RAM at 5600MHz, RTX 5090):

anikifoss Apr 12, 2025

@saood06 thanks, I'll try making my own quant targetting 32G VRAM. I could use some tips on how to validate it :)

anikifoss Apr 13, 2025

@ubergarm I tested DeepSeek-R1-UD-IQ1_S quant, and it turns out to be slower than DeepSeek-R1-UD-Q2_K_XL. It looks like the IQ quants are generally slower than the corresponding Q quants, and even slower than larger Q quants!

ikawrakow Apr 14, 2025
Maintainer

i-quants tend to be slower than k-quants (the only exceptions being IQ4_XS and IQ4_KS). Their advantage is that they tend to achieve better quality for the same number of bits spent than k-quants. In the case where this leads to being able to fully fit the model on the GPU this results in a clear performance advantage. But when using partial GPU offload, then yes, k-quants will tend to give better performance.

ikawrakow · 2025-04-14T09:05:56Z

ikawrakow
Apr 14, 2025
Maintainer

Interesting that -ctk f16 is faster while only adding about 1GiB of VRAM @ 32k context as compared to -ctk q8_0. I'll keep that in mind for how I'm running, given I might prefer the extra speed over extra context in some configs.

This is only true when attention is computed on the GPU (on the GPU fp16 is king). But for CPU-only inference, or for hybrid inference where for whatever reason the attention ops involving the KV cache are run on the CPU, q8_0 KV-cache will outperform fp16 by a significant margin.

1 reply

anikifoss Apr 14, 2025

It's interesting to see how applying one optimization immediately moves the bottleneck somewhere else, running these models is pushing the hardware limits in different ways.

Dampfinchen · 2025-04-14T18:52:59Z

Dampfinchen
Apr 14, 2025

Hello, I have a question. I'm using a laptop 2060 and I'm trying to speed up partial offloading for Gemma 3 12B.

I've compiled your build of llama.cpp with CUDA and AVX2 to see if there's any improvement compared to mainline, however it was noticeably slower. In the readme it is mentioned that for CUDA you need to offload the token embeddings tensors to the GPU, but nowhere can I see the command to do that.

I think its --override-tensor but I don't know the specific command. I tried ffn_down_exps=CUDA0 which resulted in a speedup almost on par with main, but using that and ffn_up_exps=CUDA0, gate_exps=CUDA0 results in a performance loss again (although I think the latter of which is only for MoE models?)

What is the command for doing that? Thank you!

13 replies

ikawrakow Apr 15, 2025
Maintainer

Well, RoPE can indeed only take f16 or f32 tensors. The very same assert is present in mainline as well. Are there any shenanigans being played (such as undoing RoPE for context shifting)?

Dampfinchen Apr 15, 2025

With mainline I'm not getting this error, but yes I'm pretty sure llama.cpp is using context shifting as a default. In ST there's also token padding.

So llama.cpp is probably using ctx shift while your build uses RoPE, could that be it?

ikawrakow Apr 15, 2025
Maintainer

With mainline I'm not getting this error

But are you using quantized KV cache with mainline? It is very slow, no?

Dampfinchen Apr 15, 2025

With mainline I'm not getting this error

But are you using quantized KV cache with mainline? It is very slow, no?

Yes you are very right about that. I've took a cup of coffee and... waited until it was all said and done.

prompt eval time = 723006.03 ms / 10025 tokens ( 72.12 ms per token, 13.87 tokens per second) eval time = 88686.15 ms / 182 tokens ( 487.29 ms per token, 2.05 tokens per second) total time = 811692.19 ms / 10207 tokens srv update_slots: all slots are idle

As you can see, Quant KV Cache + FA and Gemma 3 is completely unsuable with mainline llama.cpp. However, it doesn't throw the error that I've mentioned above.

ikawrakow Apr 15, 2025
Maintainer

However, it doesn't throw the error that I've mentioned above.

This is interesting. I'll need to investigate. It is not that I couldn't implement RoPE for Q8_0 quantized tensors, but something else has changed and I need to understand what (which is not easy as the two code bases have not much left in common).

cmoncure · 2025-05-13T01:48:14Z

cmoncure
May 13, 2025

Alright. I want to put down some baseline numbers. I've built a system with EPYC 9175F and 768 GB @5600, with 2x RTX 6000 Ada Generation for 96 GB VRAM. Due to my dumb ass and inexperience with this kind of hardware, I'm running without GPUs and RAM is configured at 3600 for the time being.

Pulled down ubergarm/DeepSeek-V3-0324-IQ4_K_R4 and running it with ik_llama.cpp on master, with config flags:
--run-time-repack
-mla 3 -fa
-ctk q8_0
--ctx-size 32768
-fmoe
-amb 2048
--threads 16
--threads-batch 32

RTR seems to have a huge impact. Overall things are about 66% faster than mainline llama.cpp with the unsloth 4-bit quant.
First 700 tokens PP runs at 48 t/s, then TG at 7 t/s.
With 8000 context PP drops to ~30t/s.

I'm actually okay with this TG, but I gotta get my PP up 😜; my use case requires trawling through a lot of context. I'll check back in when I get GPU working and RAM at expected speed.

28 replies

ikawrakow May 20, 2025
Maintainer

Can you share your command line that resulted in dog slow performance with 2 GPUs? With that I can give you a more informed answer to your question about expected performance increase with a 48-core CPU.

ubergarm May 20, 2025
Author

@cmoncure

Sorry I didn't comprehend all the "Case A, B, C...F" stuff above as it was too dense to read.

(my CPU has 1 core per CCD)

What really?? Oh, I found it in an AMD white paper, you're right:

the 16-core EPYC 9175F uses 16 CPU dies, each with one core per die active. This results in 32 MB L3 cache per core.

If I didn't already mention it, can you configure your BIOS to NPS1 to present a single NUMA node for all 768GB RAM? Having 16 NUMA nodes (one for each CCD / CORE) would probably be bad for performance. In general if I must run across multiple NUMA nodes I generally use numactl --interleave=all llama-server --numa distribute ...

cmoncure May 22, 2025

Hybrid LLM execution models.pdf

Okay, I illustrated it. Hope it makes things more clear.
And yes I did NPS1. Thanks!

ubergarm May 23, 2025
Author

@cmoncure

(I think "blk" is roughly equivalent to "layer"?)

Yeah GGUF naming convention is a bit different than transformers convention.

GGUF
- blk.25.attn_q_norm.weight
Transformers
- model.layers.25.self_attn.q_proj

You can learn more by using ./gguf-py/scripts/gguf_dump.py for GGUF and in transformers you can iterate over a pytorch model e.g. for name, module in model.named_modules() or something kinda like that.

I will try to describe some real and some hypothetical execution models for prompt processing, incrementally increasing the level of parallelism, and will you please note at which case execution becomes impossible and why?

Sorry, I appreciate the image but I don't understand what you're asking? Are you asking "what is the best way to run a particular LLM on my specific hardware with ik_llama.cpp right now?" ?

In general just try some things out and A/B test with llama-sweep-bench to see what is faster and keep iterating. See what commands other folks are using and what they say is faster/better. Sorry I don't have more motivation for this big question.

cmoncure May 23, 2025

what you're asking?

I'll restate the thread of discussion from the beginning.

I asked, how can I improve my PP?
@ikawrakow proposed a hypothetical scenario in which model tensors were streamed to my GPU, and PCI-E bandwidth becomes the limiting factor:

A GPU like your will do in the range of 400 t/s with DeepSeek-V3 if all tensors were in VRAM. But this is not true in your case, so you need to factor in the time it takes to offload the experts tensors to the GPU. Let's assume your PCI-E is 15 GB/s and you need to offload 360 GB worth of tensor data, so that takes 24 seconds.

I refined and extended this proposal (CASE "G"). In fact, I should have 30 GB/s of PCI-E TX bandwidth per GPU, and since I have 2 GPUs, I have 60 GB/s altogether. That means total upload time is reduced to 6 seconds if processing the batch can occur on both GPUs simultaneously.
@ikawrakow responded saying that this is impossible:

This is known as tensor parallelism (TP) or, in the llama.cpp world, as split mode (SM) "row" (as opposed to SM "layer"). Unfortunately SM "row" does not work for MoE models. Not here and also not in mainline llama.cpp.

This last response confused me, and I do not have a complete mental model of possible execution models. I do not know why:

statically splitting experts between CPU and GPU is possible (CASE "C")

streaming experts to one GPU is possible (CASE "D")

but streaming experts to two GPUs is impossible (CASE "G").

I wrote out six possible execution models (CASE "B" through "G") and asked at which case, execution becomes not supported or impossible in llama.cpp?
6. I illustrated the six cases in a PDF graphic.
7. I am asking: at which case exactly does execution become "impossible" and unsupported? Does the "split mode" differ between CASE D and CASE G? What about CASE E and F? If Batch N can meet experts 1...60 on GPU0, why can it not meet experts 1...30 on GPU0 and 31..60 on GPU1? Do we call these "layers" if split between GPU and CPU, but "row" if split between GPU and GPU?

VinnyG9 · 2025-05-13T19:02:29Z

VinnyG9
May 13, 2025

can you please add to the guide: llama-sweep-bench
where it came from?
where does it live?
what does it feed on?

1 reply

ubergarm May 13, 2025
Author

The guide is missing a lot of things as this fork has been moving pretty quickly. Your best bet in general is to search closed PRs for more details.

Regarding llama-sweep-bench:

where it came from?

I believe @saood06 introduced it in #225

where does it live?

On this fork it will be built and live in ik_llama.cpp/build/bin/llama-sweep-bench depending on your build command. I don't think it exists for mainline, but i just rebased and force pushed my fork's branch with the ported code here and tested that it compiles.

what does it feed on?

consciousness. of what else could this universe be comprised?

I have some examples in my recent speed benchmark methodology gist as well. You can use the python script that comes with it to make plots or vibe code your own plotting tool etc.

Basically you figure out the command you want to use for your specific system then replace the binary with llama-sweep-bench and it more or less will work. I really like to see the speed trade-offs for longer context which you just don't get with most other benchmark tools.

bart2 · 2025-05-20T06:11:45Z

bart2
May 20, 2025

Thanks for putting this guide together! I have to say ik_llama.cpp has been a great experience so far for me:

much faster than llama.cpp on a hybrid CPU+GPU setup
actually works, compared with ktransformers (I've spent multiple days trying to get it to work with Deepseek R1 and even smaller Qwen3 models, without success)

I'm already very happy with the tokens/s I'm getting from ik_llama.cpp when using DeepSeek-R1-UD-Q2_K_XL:

INFO [           print_timings] prompt eval time     =   17761.71 ms /  1772 tokens (   10.02 ms per token,    99.77 tokens per second) | tid="140329687441408" timestamp=1747720494 id_slot=0 id_task=0 t_prompt_processing=17761.708 n_prompt_tokens_processed=1772 t_token=10.02353724604966 n_tokens_second=99.76518024054894
INFO [           print_timings] generation eval time =  227769.84 ms /  3803 runs   (   59.89 ms per token,    16.70 tokens per second) | tid="140329687441408" timestamp=1747720494 id_slot=0 id_task=0 t_token_generation=227769.842 n_decoded=3803 t_token=59.892148829871154 n_tokens_second=16.69667927328149
INFO [           print_timings]           total time =  245531.55 ms | tid="140329687441408" timestamp=1747720494 id_slot=0 id_task=0 t_prompt_processing=17761.708 t_token_generation=227769.842 t_total=245531.55

What I'd like to try to optimize now is the context size.

Specs of the machine:

VRAM: 2x 3090 24GB
RAM: 8x64GB DDR5 for a total of 512GB
CPUs: 2x Xeon 8480

Current maximum context size I managed to get so far was 41000. Full ik_llama.cpp run arguments:

./ik_llama.cpp/build/bin/llama-server \
                                   --alias unsloth/DeepSeek-R1-Q2_K_R4 \
                                   --model ggufs/DeepSeek-R1-UD-Q2_K_XL/DeepSeek-R1-UD-Q2_K_XL-00001-of-00005.gguf \
                                   -rtr \
                                   --ctx-size 41000 \
                                   -ctk q8_0 \
                                   -mla 3 -fa \
                                   -amb 512 \
                                   -fmoe \
                                   --n-gpu-layers 63 \
                                   --override-tensor exps=CPU \
                                   --parallel 1 \
                                   --threads 56 \
                                   --host 0.0.0.0 \
                                   -ser 5,1 \
                                   --port 8080

Is there any way to squeeze a larger context size out of this hardware, while maintaining reasonable tokens/s (>15tps)?

Thanks for any help and for working on this!

20 replies

ubergarm May 20, 2025
Author

In addition to above recommendations, if you have configured BIOS to set each socket as a single NUMA node e.g. SNC=Disable (on recent intel systems), you could also try adding numactl and using more threads for PP than TG like so:

numactl --interleave=all llama-server --numa distribute --threads 56 --threads-batch 112 ...`

On intel Xeon in my limited experience the optimal number of threads for PP is larger than for TG.

bart2 May 21, 2025

@ubergarm thanks. I did disable NUMA in BIOS. With the options you suggested I'm getting ~10% faster PP:

INFO [           print_timings] prompt eval time     =   18652.78 ms /  1800 tokens (   10.36 ms per token,    96.50 tokens per second) | tid="135194909810688" timestamp=1747793997 id_slot=0 id_task=0 t_prompt_processing=18652.778 n_prompt_tokens_processed=1800 t_token=10.362654444444443 n_tokens_second=96.50037115114972
INFO [           print_timings] generation eval time =  425150.66 ms /  7052 runs   (   60.29 ms per token,    16.59 tokens per second) | tid="135194909810688" timestamp=1747793997 id_slot=0 id_task=0 t_token_generation=425150.664 n_decoded=7052 t_token=60.28795575723199 n_tokens_second=16.587061004801818
INFO [           print_timings]           total time =  443803.44 ms | tid="135194909810688" timestamp=1747793997 id_slot=0 id_task=0 t_prompt_processing=18652.778 t_token_generation=425150.664 t_total=443803.442

That's with --ctx-size 163840.

ubergarm May 21, 2025
Author

@bart2

That's with --ctx-size 163840.

Great you got it going! As ik mentioned, if you have some VRAM left-over you might be able to offload another layer or so of experts to GPU another small boost and max out performance in this configuration e.g. -ot ...=CUDA0 -ot ...=CUDA1 before the -ot exps=CPU line.

I'm not sure on sapphire rapids intel xeon, but your BIOS may also have some kind of Opportunistic Snoop Broadcast (OSB) mode which reportedly can give better performance for CPU/RAM inferencing: #201 (comment)

Finally, while -ser 5,1 improves speed, have you found any noticible loss in generation quality? Just curious.

bart2 May 22, 2025

@ubergarm, thanks for those pointers!

As for -ser 5,1, I did see some quality degradation, while the speed improvement wasn't very substantial, so I decided to stop using it.

I tried to apply your suggestion to use -ot to offload additional layers to GPU, but that resulted in... lower token generation speed. Granted, I haven't performed many tests yet.

Here are my ik_llama.cpp arguments with -ot present:

numactl --interleave=all ./ik_llama.cpp/build/bin/llama-server     --alias unsloth/DeepSeek-R1-Q2_K_R4 \
                         --model ggufs/DeepSeek-R1-UD-Q2_K_XL/DeepSeek-R1-UD-Q2_K_XL-00001-of-00005.gguf \
                         -rtr \
                         --ctx-size 163840 \
                         -ctk q8_0 \
                         -mla 1 -fa \
                         -amb 512 \
                         -fmoe \
                         --n-gpu-layers 63 \
                         --override-tensor exps=CPU \
                         --parallel 1 \
                         --threads 56 \
                         --host 0.0.0.0 \
                         --port 8080 \
                         --numa distribute \
                         --threads-batch 112 \
                         -ot "blk.*[02468].ffn.=CUDA0" \
                         -ot "blk.*[13579].ffn.=CUDA1"

Corresponding TG speed:

INFO [           print_timings] generation eval time =  571161.96 ms /  7783 runs   (   73.39 ms per token,    13.63 tokens per second) | tid="137939727560704" timestamp=1747891101 id_slot=0 id_task=0 t_token_generation=571161.96 n_decoded=7783 t_token=73.38583579596556 n_tokens_second=13.626607766385563

Then TG speed with all the same arguments, except for a lack of -ot:

INFO [           print_timings] generation eval time =  548990.12 ms /  7783 runs   (   70.54 ms per token,    14.18 tokens per second) | tid="128638380834816" timestamp=1747890059 id_slot=0 id_task=0 t_token_generation=548990.119 n_decoded=7783 t_token=70.53708325838365 n_tokens_second=14.176940040700442

Does my -ot regex look reasonable? Is there anything else I could try to speed up token generation?

ubergarm May 22, 2025
Author

@bart2

I'm not 100% on the best -ot options for DeepSeek, but you will want to put those lines with CUDAx before the one with CPU as the regex are applied in order. So maybe something like:

-ot "blk\.(3|4)\.ffn.*exps=CUDA0 \
-ot "blk\.(5|6)\.ffn.*exps=CUDA1 \
-ot exps=CPU \

The idea being to assign just one or two or however many fit until you OOM of the routed expert layers (exps) onto specific GPUs with the balance being caught by the final regex and going to CPU/RAM. Implicitly everything not overridden like attention and shared experts (shexp) will be split normally as you used -ngl 99 (or 63 whatever is fine as long as its >= number of actual layer). Though you might need to add -ts 24,24 or whatever to make it split evenly across both GPUs assuming that is the correct ratio of VRAM on each GPU.

You'll probably have to fiddle with the regex as needed to catch the right tensors/layers for your remaining VRAM. Some folks like the [3-4[0-9] style and others like the (0|2|4|6|8) style depending on how your brain works haha...

And finally you should be able to use -mla 3 again once you iron out everything above.

Good luck!

cfelicio · 2025-05-25T02:35:57Z

cfelicio
May 25, 2025

Hi Everyone,

Great thread on the subject, and was very helpful for me to optimize the oldish hardware I currently have to play with this. I wanted to share some of the results of my experiments after reading everything here, and see if anyone has any further suggestions on how to make things faster for CPU only?

1 - I'm using 2 Xeon Gold (Skylake) with 1TB of ram
2 - On the bios, I have a few options for NUMA. The first option, under processor, is called "Sub Numa Cluster", and the second option, under memory, is called "Node Interleaving"

If I enable subcluster and leave interleaving disabled, the 2 CPUs will present 4 numa nodes. With subcluster disabled and interleaving disabled, I get 1 node per CPU. And finally, with numa disabled and interleaving enabled, I get a single node for both CPUs

Using the intel mlc tool, the maximum bandwidth is achieved with 1 numa node per CPU, around 100gb / s each. Having a single node for both CPUs gives me around 130gb / s.

In theory, going with 2 nodes should be faster, but in reality, it seems like having everything consolidated under a single numa node is the fastest option (around 30% faster). I'm using Windows, perhaps the results would be better on Linux?

Best result I got so far:

G:\ik_llama>llama-bench.exe --model "G:\Qwen3-235B-A22B-128K-Q8_0-00001-of-00006.gguf" -mla 3 -fa 1 -t 28 --run-time-repack 1

model	size	params	backend	threads	fa	mla	rtr	test	t/s
============ Repacked 659 tensors
qwen3moe ?B Q8_0	232.77 GiB	235.09 B	CPU	28	1	3	1	pp512	32.30 ± 3.51
qwen3moe ?B Q8_0	232.77 GiB	235.09 B	CPU	28	1	3	1	tg128	3.80 ± 0.01

Any suggestions are appreciated! :-)

2 replies

ubergarm May 25, 2025
Author

Hey glad you got it going on your system. Thanks a lot for the detailed explanation of the BIOS settings as I don't have access to intel xeon BIOS. I had never heard of "node interleaving" option and just assumed that dual socket intel had no equivalent of AMD NPS0 to present a single numa node for both sockets.

Right, I watched a good deep dive on AMD Epyc server BIOS on level1techs youtube recently and the AMD engineers basically said "don't use NPS0 unless your workload is not optimized at all" and that is basically the case for all CPU inferencing engines so even though aggregate RAM bandwidth goes down it will likely be the fastest for now.

You could compare a single numa node setup with having 1x numa node per socket and running with numactl --interleave=all llama-server --numa distribute just to see the difference.

So quick possible optimizations thoughts for you given you are running CPU only:

Use different number of --threads 28 and --threads-batch 56 or something like that as in general PP is more CPU bottle-necked whereas TG is more RAM i/o bottlenecked. Generally for PP I would use the number of total physical cores across both CPUs and (not counting SMT/hyperthreads) and then for TG go with the number for a single CPU. You can adjust from there for your specific setup.
In general I would advise against any of those "128k" versions of the model as they are basically the same model but the GGUF has baked in the yarn options to run in 4x mode which the qwen official version does not enable on purpose and also puts a big warning on their model card that this can degrade performance if your prompts tend to be shorter than 32k when usin 4x yarn mode. Given you're getting only 30ish tok/sec PP I can't imaging you want to wait around for big 32k+ prompt lengths so just get a normal GGUF or override the yarn back to normal mode as the baked in ~40k context is plenty for most people unless they know what they are doing and really need that 32k+ context on almost every prompt. haha...
Linux might be a little faster but given you are fully in RAM you're not fighting the mmap swapping business on windows which is supposedly slower than native linux page cache. If your CPUs have a mix of P cores and E cores you might be able to play around pinning threads to P cores and all that jazz but it is probably a lot of fuss especially in windows. Linux might do a better job of thread allocation on newer kernels, but just speculating wildly.
You can probably get a boost using q8_0 for ctk/ctv kv-cache quantization as the default is f16. f16 is typically faster on cuda GPUs but takes more VRAM. q8 is generally faster on CPU than f16 and also gives the side benefit of taking less RAM. psure ik's fork will re-pack the q8_0 kv-cache under the hood for generally better performance (and old PR allows you to turn that off if you really wanted to a/b test that on your specific rig). That would be adding -ctk q8_0 -ctv q8_0 to your command.
Add -fmoe for fused moe as this version of qwen3moe supports that psure and may give some benefits even on CPU.
For actual use you probably want to use -c 32768 for a reasonable amount of context given this is a thinking model. Though at your speeds you may want to just include /no_think at the beginning of your prompts or whatever the secret word is to disable thinking for speed up at the cost of worse performance on logic/coding responses.
Finally, you might consider going with a Q4 model or rolling your own iq4_ks model as having smaller weights will likely speed up TG with similar PP (or slightly slower depending on exact quant). I know you have enough RAM to hold the big models, but it might be worth it for you to get a little more speed given you have no GPU at all.

Have fun tweaking!

cfelicio May 28, 2025

Thanks for providing such a detailed reply, this has been super helpful! I ended up spending some more time on this, and wanted to share my results:

1 - Windows turned out to be a big limitation, as it is not possible to control NUMA behavior the same way as you can in Linux. I also tried Proxmox, but could not figure out how to reach the maximum bandwidth in a Linux VM. I ended up installing Debian on bare metal, and easily got close to 200gb / s doing the Intel MLC test, with 2 numa nodes

2 - Equipped with Debian bare metal, I was now able to use numactl, and the best results were obtained with numactl --interleave=all and --numa distribute, I got over 4 tokens / s on the llama-bench. Not a spetacular result as I was expecting, but better than the max I had reached before with Windows

3 - I switched over to your model (Qwen3-235B-A22B-mix-IQ3_K) as you suggested, and that also helped in real world usage once I started the llama-server. After filling up the context, I can still get over 3 t/s, not bad!

4 - fmoe and ctk / ctv did not make much of a difference

5 - final startup command with best results: numactl --interleave=all /media/xyz/data/ik_llama/llama-bench --model /media/xyz/data/Qwen3-235B-A22B-mix-IQ3_K-00001-of-00003.gguf -mla 3 -fa 1 --run-time-repack 1 --numa distribute

Also adding the screenshots below in case anyone has a similar system and wants to play with NUMA:

Sub numa cluster settings if you want to reduce numa nodes from 4 to 2 (probably best option on Linux bare metal):

Interleaving, probably the best option on windows if you want to present a single numa node:

Sub numa cluster is disabled if you enable interleaving (as expected):

cmoncure · 2025-06-01T19:34:56Z

cmoncure
Jun 1, 2025

What's the easiest method to produce a file that simply applies the --runtime-repack transformation to an existing GGUF? I can run DeepSeek at Q_8 but the startup time is a killer.

4 replies

ubergarm Jun 1, 2025
Author

What's the easiest method to produce a file that simply applies the --runtime-repack transformation to an existing GGUF?

I ran it once a few months ago but lost my logs and my rigs are tied up at the moment. Someone was asking me on reddit too: https://www.reddit.com/r/LocalLLaMA/comments/1kb97ys/comment/mvg837s/

If you want to repack everything for CPU inferencing, it is basically ./build/bin/llama-quantize --repack inputmodel outputmodel but I haven't tested so let me know once u figure it out and I'll try to update the guide/model card with a reference and let that guy on reddit know.

There is an option for regex matching if you only want to repack some tensors, check out ./build/bin/llama-quantize --help or the code for more deets.

saood06 Jun 2, 2025
Collaborator

#274 and #272 are where you can find more details about this.

ubergarm Jun 2, 2025
Author

Thanks @saood06 I couldn't find my old logs for this but apparently I'd buried a command in a detail fold over two months ago. So @cmoncure probably something like this would work if you want to repack all the attn/shexp layers to optimize for running without any GPU:

$ ./build/bin/llama-quantize \
    --repack \
    /models/ubergarm/DeepSeek-R1-0528-GGUF/IQ2_K_R4/DeepSeek-R1-0528-IQ2_K_R4-00001-of-00005.gguf \
    /models/DeepSeek-R1-0528-IQ2_K_R4-all-repacked.gguf \
    IQ2_K_R4

Then you should be able to start up with mmap() and no longer need to wait for -rtr. Let me know if that works for you!

ciprianveg Jun 2, 2025

Thank you, I will try it this evening and let you know. Much appreciated.

sousekd · 2025-06-24T13:48:04Z

sousekd
Jun 24, 2025

Hi everyone,

First, I want to sincerely thank @ikawrakow for this amazing repo (definitely deserves much more attention!), and @ubergarm for his excellent guides, insights, and quants. Big appreciation also goes out to unsloth and bartowski.

I'm currently building a new AI/LLM machine. Although it's still a WIP (with some cooling issues), I couldn't resist running some tests. The final setup will run Proxmox, and will have multiple GPUs, but for now, it is AMD Epyc 9355 with 768 GB RAM and single RTX 4090 running Windows.

Without much expertise, I managed to compile the library with:

cmake -B build -G Ninja ^
  -DCMAKE_BUILD_TYPE=Release ^
  -DLLAMA_CURL=OFF ^
  -DGGML_CUDA=ON ^
  -DGGML_BLAS=OFF ^
  -DGGML_AVX512=ON ^
  -DGGML_AVX512_VNNI=ON ^
  -DGGML_AVX512_BF16=OFF ^
  -DCMAKE_CUDA_ARCHITECTURES=89

cmake --build build --config Release -j $env:NUMBER_OF_PROCESSORS

Honestly, I’m unsure if I'm losing performance by disabling GGML_AVX512_BF16, but I couldn't compile it with MSVC otherwise. Similarly, I'm curious about any actual benefits from enabling both GGML_AVX512 and GGML_AVX512_VNNI as I have not seen them mentioned in the guide - so I'd love some insights here!

With ik-llama finally running, I tested DeepSeek-V3 quants with various params, and ended up with these:

all of them: --no-mmap --ctx-size 32768 -mla 3 -fa -amb 512 -fmoe --n-gpu-layers 63 --override-tensor exps=CPU --parallel 1 --threads 32 --threads-batch 56
ubergarm/DeepSeek-V3-0324-IQ4_K_R4: -ctk q8_0
unsloth/DeepSeek-V3-0324-UD-Q4_K_XL: -rtr
bartowski/DeepSeek-V3-0324-Q4_K_M-V2: -rtr

Results

Observations and Thoughts

Overall, these numbers seem great to me, provided they translate effectively to real-world usage. I'm particularly surprised by the stable token-generation speed across various context sizes.
Interestingly, unsloth's quants benefited significantly from using fp16 kv-cache (default), whereas @ubergarm's quants performed best exclusively with q8_0. Bartowski's quants showed mixed effects (improved tg speed but reduced pp speed) with fp16.
Increasing threads-batch slightly improved prompt processing speed, but I don't think it justified the extra CPU load.
Raising the value of -amb didn't produce consistently measurable improvements.

Logs - ubergarm

.\build\bin\llama-sweep-bench.exe `
  --alias $alias `
  --model $model `
  --no-mmap `
  --ctx-size 32768 `
  -ctk q8_0 `
  -mla 3 -fa `
  -amb 512 `
  -fmoe `
  --n-gpu-layers 63 `
  --override-tensor exps=CPU `
  --parallel 1 `
  --threads 32 `
  --threads-batch 56

**********************

version: 3762 (1843ed22)
built with MSVC 19.44.35211.0 for
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
llama_model_loader: additional 9 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 53 key-value pairs and 1147 tensors from C:\Users\Administrator\.lms
tudio\models\ubergarm\DeepSeek-V3-0324-GGUF\DeepSeek-V3-0324-IQ4_K_R4-00001-of-00010.gguf (version GGUF V3 (la
test))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = deepseek2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = DeepSeek V3 0324
llama_model_loader: - kv   3:                            general.version str              = V3-0324
llama_model_loader: - kv   4:                           general.basename str              = DeepSeek
llama_model_loader: - kv   5:                         general.size_label str              = 256x21B
llama_model_loader: - kv   6:                            general.license str              = mit
llama_model_loader: - kv   7:                      deepseek2.block_count u32              = 61
llama_model_loader: - kv   8:                   deepseek2.context_length u32              = 163840
llama_model_loader: - kv   9:                 deepseek2.embedding_length u32              = 7168
llama_model_loader: - kv  10:              deepseek2.feed_forward_length u32              = 18432
llama_model_loader: - kv  11:             deepseek2.attention.head_count u32              = 128
llama_model_loader: - kv  12:          deepseek2.attention.head_count_kv u32              = 128
llama_model_loader: - kv  13:                   deepseek2.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  14: deepseek2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  15:                deepseek2.expert_used_count u32              = 8
llama_model_loader: - kv  16:                          general.file_type u32              = 340
llama_model_loader: - kv  17:        deepseek2.leading_dense_block_count u32              = 3
llama_model_loader: - kv  18:                       deepseek2.vocab_size u32              = 129280
llama_model_loader: - kv  19:            deepseek2.attention.q_lora_rank u32              = 1536
llama_model_loader: - kv  20:           deepseek2.attention.kv_lora_rank u32              = 512
llama_model_loader: - kv  21:             deepseek2.attention.key_length u32              = 192
llama_model_loader: - kv  22:           deepseek2.attention.value_length u32              = 128
llama_model_loader: - kv  23:       deepseek2.expert_feed_forward_length u32              = 2048
llama_model_loader: - kv  24:                     deepseek2.expert_count u32              = 256
llama_model_loader: - kv  25:              deepseek2.expert_shared_count u32              = 1
llama_model_loader: - kv  26:             deepseek2.expert_weights_scale f32              = 2.500000
llama_model_loader: - kv  27:              deepseek2.expert_weights_norm bool             = true
llama_model_loader: - kv  28:               deepseek2.expert_gating_func u32              = 2
llama_model_loader: - kv  29:             deepseek2.rope.dimension_count u32              = 64
llama_model_loader: - kv  30:                deepseek2.rope.scaling.type str              = yarn
llama_model_loader: - kv  31:              deepseek2.rope.scaling.factor f32              = 40.000000
llama_model_loader: - kv  32: deepseek2.rope.scaling.original_context_length u32              = 4096
llama_model_loader: - kv  33: deepseek2.rope.scaling.yarn_log_multiplier f32              = 0.100000
llama_model_loader: - kv  34:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  35:                         tokenizer.ggml.pre str              = deepseek-v3
llama_model_loader: - kv  36:                      tokenizer.ggml.tokens arr[str,129280]  = ["<∩╜£beginΓûüofΓû
üsentence∩╜£>", "<∩...
llama_model_loader: - kv  37:                  tokenizer.ggml.token_type arr[i32,129280]  = [3, 3, 3, 1, 1, 1,
 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  38:                      tokenizer.ggml.merges arr[str,127741]  = ["─á t", "─á a", "
i n", "─á ─á", "h e...
llama_model_loader: - kv  39:                tokenizer.ggml.bos_token_id u32              = 0
llama_model_loader: - kv  40:                tokenizer.ggml.eos_token_id u32              = 1
llama_model_loader: - kv  41:            tokenizer.ggml.padding_token_id u32              = 1
llama_model_loader: - kv  42:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  43:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  44:                    tokenizer.chat_template str              = {% if not add_gene
ration_prompt is de...
llama_model_loader: - kv  45:               general.quantization_version u32              = 2
llama_model_loader: - kv  46:                      quantize.imatrix.file str              = /mnt/raid/models/u
bergarm/DeepSeek-V3...
llama_model_loader: - kv  47:                   quantize.imatrix.dataset str              = calibration_data_v
5_rc.txt
llama_model_loader: - kv  48:             quantize.imatrix.entries_count i32              = 720
llama_model_loader: - kv  49:              quantize.imatrix.chunks_count i32              = 213
llama_model_loader: - kv  50:                                   split.no u16              = 0
llama_model_loader: - kv  51:                                split.count u16              = 10
llama_model_loader: - kv  52:                        split.tensors.count i32              = 1147
llama_model_loader: - type  f32:  361 tensors
llama_model_loader: - type q8_0:  612 tensors
llama_model_loader: - type iq4_k_r4:  116 tensors
llama_model_loader: - type iq5_k_r4:   58 tensors
llm_load_vocab: special tokens cache size = 818
llm_load_vocab: token to piece cache size = 0.8223 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = deepseek2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 129280
llm_load_print_meta: n_merges         = 127741
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 163840
llm_load_print_meta: n_embd           = 7168
llm_load_print_meta: n_layer          = 61
llm_load_print_meta: n_head           = 128
llm_load_print_meta: n_head_kv        = 128
llm_load_print_meta: n_rot            = 64
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_swa_pattern    = 1
llm_load_print_meta: n_embd_head_k    = 192
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 24576
llm_load_print_meta: n_embd_v_gqa     = 16384
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 18432
llm_load_print_meta: n_expert         = 256
llm_load_print_meta: n_expert_used    = 8
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = yarn
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 0.025
llm_load_print_meta: n_ctx_orig_yarn  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 671B
llm_load_print_meta: model ftype      = IQ4_K_R4 - 4.5 bpw
llm_load_print_meta: model params     = 672.050 B
llm_load_print_meta: model size       = 386.183 GiB (4.936 BPW)
llm_load_print_meta: repeating layers = 384.349 GiB (4.926 BPW, 670.196 B parameters)
llm_load_print_meta: general.name     = DeepSeek V3 0324
llm_load_print_meta: BOS token        = 0 '<∩╜£beginΓûüofΓûüsentence∩╜£>'
llm_load_print_meta: EOS token        = 1 '<∩╜£endΓûüofΓûüsentence∩╜£>'
llm_load_print_meta: PAD token        = 1 '<∩╜£endΓûüofΓûüsentence∩╜£>'
llm_load_print_meta: LF token         = 131 '├ä'
llm_load_print_meta: max token length = 256
llm_load_print_meta: n_layer_dense_lead   = 3
llm_load_print_meta: n_lora_q             = 1536
llm_load_print_meta: n_lora_kv            = 512
llm_load_print_meta: n_ff_exp             = 2048
llm_load_print_meta: n_expert_shared      = 1
llm_load_print_meta: expert_weights_scale = 2.5
llm_load_print_meta: expert_weights_norm  = 1
llm_load_print_meta: expert_gating_func   = sigmoid
llm_load_print_meta: rope_yarn_log_mul    = 0.1000
llm_load_tensors: ggml ctx size =    0.93 MiB
Tensor blk.3.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.3.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.3.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.4.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.4.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.4.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.5.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.5.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.5.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.6.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.6.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.6.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.7.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.7.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.7.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.8.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.8.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.8.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.9.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.9.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.9.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.10.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.10.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.10.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.11.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.11.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.11.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.12.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.12.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.12.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.13.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.13.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.13.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.14.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.14.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.14.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.15.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.15.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.15.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.16.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.16.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.16.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.17.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.17.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.17.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.18.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.18.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.18.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.19.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.19.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.19.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.20.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.20.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.20.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.21.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.21.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.21.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.22.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.22.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.22.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.23.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.23.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.23.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.24.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.24.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.24.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.25.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.25.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.25.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.26.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.26.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.26.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.27.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.27.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.27.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.28.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.28.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.28.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.29.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.29.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.29.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.30.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.30.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.30.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.31.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.31.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.31.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.32.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.32.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.32.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.33.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.33.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.33.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.34.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.34.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.34.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.35.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.35.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.35.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.36.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.36.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.36.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.37.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.37.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.37.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.38.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.38.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.38.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.39.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.39.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.39.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.40.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.40.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.40.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.41.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.41.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.41.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.42.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.42.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.42.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.43.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.43.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.43.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.44.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.44.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.44.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.45.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.45.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.45.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.46.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.46.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.46.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.47.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.47.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.47.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.48.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.48.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.48.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.49.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.49.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.49.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.50.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.50.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.50.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.51.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.51.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.51.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.52.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.52.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.52.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.53.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.53.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.53.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.54.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.54.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.54.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.55.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.55.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.55.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.56.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.56.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.56.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.57.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.57.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.57.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.58.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.58.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.58.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.59.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.59.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.59.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.60.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.60.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.60.ffn_up_exps.weight buffer type overriden to CPU
llm_load_tensors: offloading 61 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 62/62 layers to GPU
llm_load_tensors:        CPU buffer size = 376768.00 MiB
llm_load_tensors:  CUDA_Host buffer size =   938.98 MiB
llm_load_tensors:      CUDA0 buffer size = 17744.02 MiB
....................................................................................................
llama_new_context_with_model: n_ctx      = 32768
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn   = 3
llama_new_context_with_model: attn_max_b = 512
llama_new_context_with_model: fused_moe  = 1
llama_new_context_with_model: ser        = -1, 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 0.025
llama_kv_cache_init:      CUDA0 KV buffer size =  1166.65 MiB
llama_new_context_with_model: KV self size  = 1166.62 MiB, c^KV (q8_0): 1166.62 MiB, kv^T: not used
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.49 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =  3425.00 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =   176.01 MiB
llama_new_context_with_model: graph nodes  = 8245
llama_new_context_with_model: graph splits = 118

main: n_kv_max = 32768, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_layers = 63, n_threads = 32, n_t
hreads_batch = 56

|    PP |     TG |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
|   512 |    128 |      0 |    4.466 |   114.64 |   10.403 |    12.30 |
|   512 |    128 |    512 |    4.320 |   118.52 |    9.744 |    13.14 |
|   512 |    128 |   1024 |    4.380 |   116.89 |   10.437 |    12.26 |
|   512 |    128 |   1536 |    4.487 |   114.11 |   10.327 |    12.40 |
|   512 |    128 |   2048 |    4.533 |   112.95 |   10.421 |    12.28 |
|   512 |    128 |   2560 |    4.559 |   112.31 |   10.471 |    12.22 |
|   512 |    128 |   3072 |    4.612 |   111.00 |   10.448 |    12.25 |
|   512 |    128 |   3584 |    4.745 |   107.91 |   10.462 |    12.24 |
|   512 |    128 |   4096 |    4.753 |   107.72 |   10.466 |    12.23 |
|   512 |    128 |   4608 |    4.759 |   107.58 |   10.519 |    12.17 |
|   512 |    128 |   5120 |    4.843 |   105.71 |   10.499 |    12.19 |
|   512 |    128 |   5632 |    4.875 |   105.02 |   10.533 |    12.15 |
|   512 |    128 |   6144 |    4.955 |   103.34 |   10.528 |    12.16 |
|   512 |    128 |   6656 |    4.934 |   103.76 |   10.497 |    12.19 |
|   512 |    128 |   7168 |    5.001 |   102.38 |   10.300 |    12.43 |
|   512 |    128 |   7680 |    5.047 |   101.45 |   10.569 |    12.11 |
|   512 |    128 |   8192 |    5.113 |   100.14 |   10.597 |    12.08 |
|   512 |    128 |   8704 |    5.131 |    99.78 |   10.629 |    12.04 |
|   512 |    128 |   9216 |    5.194 |    98.57 |   10.704 |    11.96 |
|   512 |    128 |   9728 |    5.251 |    97.50 |   10.628 |    12.04 |
|   512 |    128 |  10240 |    5.287 |    96.83 |   10.616 |    12.06 |
|   512 |    128 |  10752 |    5.365 |    95.43 |   10.650 |    12.02 |
|   512 |    128 |  11264 |    5.368 |    95.38 |   10.710 |    11.95 |
|   512 |    128 |  11776 |    5.458 |    93.81 |   10.627 |    12.05 |
|   512 |    128 |  12288 |    5.496 |    93.16 |   10.754 |    11.90 |
|   512 |    128 |  12800 |    5.529 |    92.60 |   10.733 |    11.93 |
|   512 |    128 |  13312 |    5.576 |    91.83 |   10.911 |    11.73 |
|   512 |    128 |  13824 |    5.619 |    91.13 |   10.819 |    11.83 |
|   512 |    128 |  14336 |    5.687 |    90.03 |   10.846 |    11.80 |
|   512 |    128 |  14848 |    5.691 |    89.96 |   10.810 |    11.84 |
|   512 |    128 |  15360 |    5.724 |    89.46 |   10.801 |    11.85 |
|   512 |    128 |  15872 |    5.760 |    88.89 |   10.873 |    11.77 |
|   512 |    128 |  16384 |    5.883 |    87.03 |   10.901 |    11.74 |
|   512 |    128 |  16896 |    5.841 |    87.65 |   10.957 |    11.68 |
|   512 |    128 |  17408 |    5.964 |    85.85 |   11.025 |    11.61 |
|   512 |    128 |  17920 |    5.997 |    85.37 |   11.007 |    11.63 |
|   512 |    128 |  18432 |    6.030 |    84.91 |   11.038 |    11.60 |
|   512 |    128 |  18944 |    6.049 |    84.64 |   11.101 |    11.53 |
|   512 |    128 |  19456 |    6.140 |    83.39 |   11.039 |    11.60 |
|   512 |    128 |  19968 |    6.148 |    83.28 |   11.076 |    11.56 |
|   512 |    128 |  20480 |    6.179 |    82.87 |   11.175 |    11.45 |
|   512 |    128 |  20992 |    6.191 |    82.70 |   11.187 |    11.44 |
|   512 |    128 |  21504 |    6.209 |    82.46 |   11.236 |    11.39 |
|   512 |    128 |  22016 |    6.239 |    82.06 |   11.281 |    11.35 |
|   512 |    128 |  22528 |    6.298 |    81.30 |   11.285 |    11.34 |
|   512 |    128 |  23040 |    6.322 |    80.98 |   11.125 |    11.51 |
|   512 |    128 |  23552 |    6.234 |    82.13 |   11.367 |    11.26 |
|   512 |    128 |  24064 |    6.310 |    81.14 |   11.266 |    11.36 |
|   512 |    128 |  24576 |    6.318 |    81.04 |   11.342 |    11.29 |
|   512 |    128 |  25088 |    6.376 |    80.30 |   11.466 |    11.16 |
|   512 |    128 |  25600 |    6.430 |    79.62 |   11.501 |    11.13 |
|   512 |    128 |  26112 |    6.458 |    79.28 |   11.450 |    11.18 |
|   512 |    128 |  26624 |    6.523 |    78.49 |   11.467 |    11.16 |
|   512 |    128 |  27136 |    6.561 |    78.04 |   11.488 |    11.14 |
|   512 |    128 |  27648 |    6.604 |    77.53 |   11.481 |    11.15 |
|   512 |    128 |  28160 |    6.645 |    77.05 |   11.459 |    11.17 |
|   512 |    128 |  28672 |    6.693 |    76.50 |   11.645 |    10.99 |
|   512 |    128 |  29184 |    6.755 |    75.79 |   11.578 |    11.06 |
|   512 |    128 |  29696 |    6.766 |    75.67 |   11.740 |    10.90 |
|   512 |    128 |  30208 |    6.836 |    74.89 |   11.603 |    11.03 |
|   512 |    128 |  30720 |    6.854 |    74.70 |   11.567 |    11.07 |
|   512 |    128 |  31232 |    6.929 |    73.89 |   11.580 |    11.05 |
|   512 |    128 |  31744 |    6.962 |    73.55 |   11.654 |    10.98 |
|   512 |    128 |  32256 |    7.028 |    72.85 |   11.674 |    10.96 |

Logs - unsloth

.\build\bin\llama-sweep-bench.exe `
  --alias $alias `
  --model $model `
  --no-mmap `
  --ctx-size 32768 `
  -mla 3 -fa `
  -amb 512 `
  -fmoe `
  -rtr `
  --n-gpu-layers 63 `
  --override-tensor exps=CPU `
  --parallel 1 `
  --threads 32 `
  --threads-batch 56

**********************

version: 3762 (1843ed22)
built with MSVC 19.44.35211.0 for
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
llama_model_loader: additional 7 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 64 key-value pairs and 1086 tensors from C:\Users\Administrator\.lms
tudio\models\unsloth\DeepSeek-V3-0324-GGUF-UD\DeepSeek-V3-0324-UD-Q4_K_XL-00001-of-00008.gguf (version GGUF V3
 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = deepseek2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Deepseek-V3-0324
llama_model_loader: - kv   3:                            general.version str              = V3-0324
llama_model_loader: - kv   4:                           general.basename str              = Deepseek-V3-0324
llama_model_loader: - kv   5:                       general.quantized_by str              = Unsloth
llama_model_loader: - kv   6:                         general.size_label str              = 256x20B
llama_model_loader: - kv   7:                            general.license str              = mit
llama_model_loader: - kv   8:                           general.repo_url str              = https://huggingfac
e.co/unsloth
llama_model_loader: - kv   9:                   general.base_model.count u32              = 1
llama_model_loader: - kv  10:                  general.base_model.0.name str              = DeepSeek V3 0324
llama_model_loader: - kv  11:               general.base_model.0.version str              = V3-0324
llama_model_loader: - kv  12:          general.base_model.0.organization str              = Deepseek Ai
llama_model_loader: - kv  13:              general.base_model.0.repo_url str              = https://huggingfac
e.co/deepseek-ai/De...
llama_model_loader: - kv  14:                               general.tags arr[str,4]       = ["deepseek_v3", "d
eepseek", "unsloth"...
llama_model_loader: - kv  15:                          general.languages arr[str,1]       = ["en"]
llama_model_loader: - kv  16:                      deepseek2.block_count u32              = 61
llama_model_loader: - kv  17:                   deepseek2.context_length u32              = 163840
llama_model_loader: - kv  18:                 deepseek2.embedding_length u32              = 7168
llama_model_loader: - kv  19:              deepseek2.feed_forward_length u32              = 18432
llama_model_loader: - kv  20:             deepseek2.attention.head_count u32              = 128
llama_model_loader: - kv  21:          deepseek2.attention.head_count_kv u32              = 1
llama_model_loader: - kv  22:                   deepseek2.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  23: deepseek2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  24:                deepseek2.expert_used_count u32              = 8
llama_model_loader: - kv  25:        deepseek2.leading_dense_block_count u32              = 3
llama_model_loader: - kv  26:                       deepseek2.vocab_size u32              = 129280
llama_model_loader: - kv  27:            deepseek2.attention.q_lora_rank u32              = 1536
llama_model_loader: - kv  28:           deepseek2.attention.kv_lora_rank u32              = 512
llama_model_loader: - kv  29:             deepseek2.attention.key_length u32              = 576
llama_model_loader: - kv  30:           deepseek2.attention.value_length u32              = 512
llama_model_loader: - kv  31:         deepseek2.attention.key_length_mla u32              = 192
llama_model_loader: - kv  32:       deepseek2.attention.value_length_mla u32              = 128
llama_model_loader: - kv  33:       deepseek2.expert_feed_forward_length u32              = 2048
llama_model_loader: - kv  34:                     deepseek2.expert_count u32              = 256
llama_model_loader: - kv  35:              deepseek2.expert_shared_count u32              = 1
llama_model_loader: - kv  36:             deepseek2.expert_weights_scale f32              = 2.500000
llama_model_loader: - kv  37:              deepseek2.expert_weights_norm bool             = true
llama_model_loader: - kv  38:               deepseek2.expert_gating_func u32              = 2
llama_model_loader: - kv  39:             deepseek2.rope.dimension_count u32              = 64
llama_model_loader: - kv  40:                deepseek2.rope.scaling.type str              = yarn
llama_model_loader: - kv  41:              deepseek2.rope.scaling.factor f32              = 40.000000
llama_model_loader: - kv  42: deepseek2.rope.scaling.original_context_length u32              = 4096
llama_model_loader: - kv  43: deepseek2.rope.scaling.yarn_log_multiplier f32              = 0.100000
llama_model_loader: - kv  44:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  45:                         tokenizer.ggml.pre str              = deepseek-v3
llama_model_loader: - kv  46:                      tokenizer.ggml.tokens arr[str,129280]  = ["<∩╜£beginΓûüofΓû
üsentence∩╜£>", "<∩...
llama_model_loader: - kv  47:                  tokenizer.ggml.token_type arr[i32,129280]  = [3, 3, 3, 1, 1, 1,
 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  48:                      tokenizer.ggml.merges arr[str,127741]  = ["─á t", "─á a", "
i n", "─á ─á", "h e...
llama_model_loader: - kv  49:                tokenizer.ggml.bos_token_id u32              = 0
llama_model_loader: - kv  50:                tokenizer.ggml.eos_token_id u32              = 1
llama_model_loader: - kv  51:            tokenizer.ggml.padding_token_id u32              = 2
llama_model_loader: - kv  52:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  53:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  54:                    tokenizer.chat_template str              = {% if not add_gene
ration_prompt is de...
llama_model_loader: - kv  55:               general.quantization_version u32              = 2
llama_model_loader: - kv  56:                          general.file_type u32              = 15
llama_model_loader: - kv  57:                      quantize.imatrix.file str              = DeepSeek-V3-0324-G
GUF/imatrix_unsloth...
llama_model_loader: - kv  58:                   quantize.imatrix.dataset str              = unsloth_calibratio
n_DeepSeek-V3-0324.txt
llama_model_loader: - kv  59:             quantize.imatrix.entries_count i32              = 720
llama_model_loader: - kv  60:              quantize.imatrix.chunks_count i32              = 60
llama_model_loader: - kv  61:                                   split.no u16              = 0
llama_model_loader: - kv  62:                        split.tensors.count i32              = 1086
llama_model_loader: - kv  63:                                split.count u16              = 8
llama_model_loader: - type  f32:  361 tensors
llama_model_loader: - type q8_0:  122 tensors
llama_model_loader: - type q4_K:  485 tensors
llama_model_loader: - type q5_K:   95 tensors
llama_model_loader: - type q6_K:   23 tensors
==========================================================================
Detected incompatible DeepSeek model.
Will try to fix, but there are no guarantees

*** Your prompt processing speed will be crippled ***

Consider making your own ik_llama.cpp compatible model or
ask the model provider to make one for you,
==========================================================================
llm_load_vocab: special tokens cache size = 818
llm_load_vocab: token to piece cache size = 0.8223 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = deepseek2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 129280
llm_load_print_meta: n_merges         = 127741
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 163840
llm_load_print_meta: n_embd           = 7168
llm_load_print_meta: n_layer          = 61
llm_load_print_meta: n_head           = 128
llm_load_print_meta: n_head_kv        = 128
llm_load_print_meta: n_rot            = 64
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_swa_pattern    = 1
llm_load_print_meta: n_embd_head_k    = 192
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 24576
llm_load_print_meta: n_embd_v_gqa     = 16384
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 18432
llm_load_print_meta: n_expert         = 256
llm_load_print_meta: n_expert_used    = 8
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = yarn
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 0.025
llm_load_print_meta: n_ctx_orig_yarn  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 671B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 671.026 B
llm_load_print_meta: model size       = 357.623 GiB (4.578 BPW)
llm_load_print_meta: repeating layers = 356.429 GiB (4.575 BPW, 669.173 B parameters)
llm_load_print_meta: general.name     = Deepseek-V3-0324
llm_load_print_meta: BOS token        = 0 '<∩╜£beginΓûüofΓûüsentence∩╜£>'
llm_load_print_meta: EOS token        = 1 '<∩╜£endΓûüofΓûüsentence∩╜£>'
llm_load_print_meta: PAD token        = 2 '<∩╜£ΓûüpadΓûü∩╜£>'
llm_load_print_meta: LF token         = 131 '├ä'
llm_load_print_meta: max token length = 256
llm_load_print_meta: n_layer_dense_lead   = 3
llm_load_print_meta: n_lora_q             = 1536
llm_load_print_meta: n_lora_kv            = 512
llm_load_print_meta: n_ff_exp             = 2048
llm_load_print_meta: n_expert_shared      = 1
llm_load_print_meta: expert_weights_scale = 2.5
llm_load_print_meta: expert_weights_norm  = 1
llm_load_print_meta: expert_gating_func   = sigmoid
llm_load_print_meta: rope_yarn_log_mul    = 0.1000
llm_load_tensors: ggml ctx size =    0.89 MiB
Tensor blk.3.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.3.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.3.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.4.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.4.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.4.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.5.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.5.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.5.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.6.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.6.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.6.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.7.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.7.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.7.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.8.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.8.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.8.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.9.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.9.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.9.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.10.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.10.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.10.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.11.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.11.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.11.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.12.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.12.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.12.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.13.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.13.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.13.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.14.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.14.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.14.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.15.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.15.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.15.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.16.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.16.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.16.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.17.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.17.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.17.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.18.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.18.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.18.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.19.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.19.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.19.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.20.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.20.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.20.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.21.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.21.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.21.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.22.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.22.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.22.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.23.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.23.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.23.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.24.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.24.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.24.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.25.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.25.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.25.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.26.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.26.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.26.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.27.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.27.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.27.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.28.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.28.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.28.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.29.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.29.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.29.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.30.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.30.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.30.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.31.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.31.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.31.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.32.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.32.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.32.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.33.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.33.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.33.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.34.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.34.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.34.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.35.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.35.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.35.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.36.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.36.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.36.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.37.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.37.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.37.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.38.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.38.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.38.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.39.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.39.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.39.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.40.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.40.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.40.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.41.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.41.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.41.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.42.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.42.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.42.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.43.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.43.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.43.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.44.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.44.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.44.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.45.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.45.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.45.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.46.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.46.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.46.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.47.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.47.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.47.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.48.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.48.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.48.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.49.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.49.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.49.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.50.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.50.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.50.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.51.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.51.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.51.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.52.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.52.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.52.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.53.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.53.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.53.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.54.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.54.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.54.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.55.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.55.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.55.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.56.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.56.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.56.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.57.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.57.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.57.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.58.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.58.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.58.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.59.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.59.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.59.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.60.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.60.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.60.ffn_up_exps.weight buffer type overriden to CPU
llm_load_tensors: offloading 61 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 62/62 layers to GPU
llm_load_tensors:        CPU buffer size = 355712.00 MiB
llm_load_tensors:  CUDA_Host buffer size =   497.11 MiB
llm_load_tensors:      CUDA0 buffer size =  9996.68 MiB
....................................................................................................
============ llm_prepare_mla: need to compute 61 wkv_b tensors
Computed blk.0.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.1.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.2.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.3.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.4.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.5.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.6.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.7.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.8.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.9.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.10.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.11.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.12.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.13.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.14.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.15.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.16.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.17.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.18.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.19.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.20.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.21.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.22.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.23.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.24.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.25.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.26.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.27.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.28.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.29.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.30.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.31.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.32.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.33.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.34.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.35.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.36.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.37.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.38.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.39.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.40.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.41.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.42.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.43.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.44.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.45.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.46.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.47.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.48.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.49.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.50.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.51.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.52.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.53.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.54.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.55.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.56.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.57.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.58.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.59.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.60.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
============ Repacked 174 tensors
llama_new_context_with_model: n_ctx      = 32768
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn   = 3
llama_new_context_with_model: attn_max_b = 512
llama_new_context_with_model: fused_moe  = 1
llama_new_context_with_model: ser        = -1, 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 0.025
llama_kv_cache_init:      CUDA0 KV buffer size =  2196.00 MiB
llama_new_context_with_model: KV self size  = 2196.00 MiB, c^KV (f16): 2196.00 MiB, kv^T: not used
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.49 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =  3393.00 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =   176.01 MiB
llama_new_context_with_model: graph nodes  = 8184
llama_new_context_with_model: graph splits = 118

main: n_kv_max = 32768, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_layers = 63, n_threads = 32, n_t
hreads_batch = 56

|    PP |     TG |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
|   512 |    128 |      0 |    3.817 |   134.15 |    8.525 |    15.01 |
|   512 |    128 |    512 |    3.815 |   134.20 |    8.333 |    15.36 |
|   512 |    128 |   1024 |    3.861 |   132.61 |    7.549 |    16.96 |
|   512 |    128 |   1536 |    3.945 |   129.79 |    7.784 |    16.44 |
|   512 |    128 |   2048 |    4.024 |   127.22 |    7.767 |    16.48 |
|   512 |    128 |   2560 |    4.071 |   125.77 |    7.734 |    16.55 |
|   512 |    128 |   3072 |    4.104 |   124.77 |    7.632 |    16.77 |
|   512 |    128 |   3584 |    4.118 |   124.34 |    7.538 |    16.98 |
|   512 |    128 |   4096 |    4.149 |   123.42 |    7.642 |    16.75 |
|   512 |    128 |   4608 |    4.203 |   121.81 |    7.593 |    16.86 |
|   512 |    128 |   5120 |    4.269 |   119.93 |    7.552 |    16.95 |
|   512 |    128 |   5632 |    4.385 |   116.76 |    7.895 |    16.21 |
|   512 |    128 |   6144 |    4.354 |   117.58 |    7.571 |    16.91 |
|   512 |    128 |   6656 |    4.401 |   116.34 |    7.799 |    16.41 |
|   512 |    128 |   7168 |    4.444 |   115.22 |    7.713 |    16.59 |
|   512 |    128 |   7680 |    4.476 |   114.38 |    7.560 |    16.93 |
|   512 |    128 |   8192 |    4.529 |   113.04 |    7.869 |    16.27 |
|   512 |    128 |   8704 |    4.582 |   111.74 |    7.763 |    16.49 |
|   512 |    128 |   9216 |    4.623 |   110.75 |    8.812 |    14.53 |
|   512 |    128 |   9728 |    4.578 |   111.83 |    7.681 |    16.67 |
|   512 |    128 |  10240 |    4.657 |   109.93 |    8.100 |    15.80 |
|   512 |    128 |  10752 |    4.645 |   110.23 |    7.979 |    16.04 |
|   512 |    128 |  11264 |    4.689 |   109.20 |    7.788 |    16.44 |
|   512 |    128 |  11776 |    4.712 |   108.66 |    7.848 |    16.31 |
|   512 |    128 |  12288 |    4.760 |   107.56 |    8.004 |    15.99 |
|   512 |    128 |  12800 |    4.782 |   107.06 |    7.851 |    16.30 |
|   512 |    128 |  13312 |    4.799 |   106.68 |    7.854 |    16.30 |
|   512 |    128 |  13824 |    4.824 |   106.13 |    8.000 |    16.00 |
|   512 |    128 |  14336 |    4.874 |   105.06 |    7.954 |    16.09 |
|   512 |    128 |  14848 |    4.907 |   104.33 |    7.955 |    16.09 |
|   512 |    128 |  15360 |    4.959 |   103.25 |    7.978 |    16.04 |
|   512 |    128 |  15872 |    4.999 |   102.42 |    8.069 |    15.86 |
|   512 |    128 |  16384 |    5.132 |    99.77 |    8.207 |    15.60 |
|   512 |    128 |  16896 |    5.173 |    98.97 |    8.071 |    15.86 |
|   512 |    128 |  17408 |    5.225 |    97.99 |    8.193 |    15.62 |
|   512 |    128 |  17920 |    5.285 |    96.88 |    8.241 |    15.53 |
|   512 |    128 |  18432 |    5.314 |    96.34 |    8.116 |    15.77 |
|   512 |    128 |  18944 |    5.367 |    95.40 |    8.320 |    15.38 |
|   512 |    128 |  19456 |    5.393 |    94.93 |    8.097 |    15.81 |
|   512 |    128 |  19968 |    5.458 |    93.80 |    8.255 |    15.51 |
|   512 |    128 |  20480 |    5.501 |    93.07 |    8.299 |    15.42 |
|   512 |    128 |  20992 |    5.554 |    92.19 |    8.348 |    15.33 |
|   512 |    128 |  21504 |    5.592 |    91.56 |    8.309 |    15.41 |
|   512 |    128 |  22016 |    5.630 |    90.94 |    8.290 |    15.44 |
|   512 |    128 |  22528 |    5.688 |    90.01 |    8.290 |    15.44 |
|   512 |    128 |  23040 |    5.742 |    89.16 |    8.328 |    15.37 |
|   512 |    128 |  23552 |    5.732 |    89.32 |    8.413 |    15.21 |
|   512 |    128 |  24064 |    5.794 |    88.37 |    8.332 |    15.36 |
|   512 |    128 |  24576 |    5.827 |    87.87 |    8.407 |    15.22 |
|   512 |    128 |  25088 |    5.858 |    87.40 |    8.496 |    15.07 |
|   512 |    128 |  25600 |    5.927 |    86.38 |    8.373 |    15.29 |
|   512 |    128 |  26112 |    5.940 |    86.20 |    8.351 |    15.33 |
|   512 |    128 |  26624 |    6.010 |    85.20 |    8.577 |    14.92 |
|   512 |    128 |  27136 |    6.041 |    84.75 |    8.469 |    15.11 |
|   512 |    128 |  27648 |    6.100 |    83.93 |    8.559 |    14.96 |
|   512 |    128 |  28160 |    6.129 |    83.54 |    8.455 |    15.14 |
|   512 |    128 |  28672 |    6.172 |    82.95 |    8.481 |    15.09 |
|   512 |    128 |  29184 |    6.246 |    81.97 |    8.614 |    14.86 |
|   512 |    128 |  29696 |    6.262 |    81.76 |    8.672 |    14.76 |
|   512 |    128 |  30208 |    6.315 |    81.08 |    8.628 |    14.84 |
|   512 |    128 |  30720 |    6.357 |    80.54 |    8.561 |    14.95 |
|   512 |    128 |  31232 |    6.401 |    79.99 |    8.638 |    14.82 |
|   512 |    128 |  31744 |    6.482 |    78.99 |    8.723 |    14.67 |
|   512 |    128 |  32256 |    6.521 |    78.51 |    8.618 |    14.85 |

Logs - bartowski

.\build\bin\llama-sweep-bench.exe `
  --alias $alias `
  --model $model `
  --no-mmap `
  --ctx-size 32768 `
  -mla 3 -fa `
  -amb 512 `
  -fmoe `
  -rtr `
  --n-gpu-layers 63 `
  --override-tensor exps=CPU `
  --parallel 1 `
  --threads 32 `
  --threads-batch 56

**********************

version: 3762 (1843ed22)
built with MSVC 19.44.35211.0 for
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
llama_model_loader: additional 10 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 53 key-value pairs and 1025 tensors from C:\Users\Administrator\.lms
tudio\models\bartowski\deepseek-ai_DeepSeek-V3-0324-GGUF\deepseek-ai_DeepSeek-V3-0324-Q4_K_M-V2-00001-of-00011
.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = deepseek2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = DeepSeek V3 0324
llama_model_loader: - kv   3:                            general.version str              = V3-0324
llama_model_loader: - kv   4:                           general.basename str              = DeepSeek
llama_model_loader: - kv   5:                         general.size_label str              = 256x20B
llama_model_loader: - kv   6:                            general.license str              = mit
llama_model_loader: - kv   7:                      deepseek2.block_count u32              = 61
llama_model_loader: - kv   8:                   deepseek2.context_length u32              = 163840
llama_model_loader: - kv   9:                 deepseek2.embedding_length u32              = 7168
llama_model_loader: - kv  10:              deepseek2.feed_forward_length u32              = 18432
llama_model_loader: - kv  11:             deepseek2.attention.head_count u32              = 128
llama_model_loader: - kv  12:          deepseek2.attention.head_count_kv u32              = 128
llama_model_loader: - kv  13:                   deepseek2.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  14: deepseek2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  15:                deepseek2.expert_used_count u32              = 8
llama_model_loader: - kv  16:        deepseek2.leading_dense_block_count u32              = 3
llama_model_loader: - kv  17:                       deepseek2.vocab_size u32              = 129280
llama_model_loader: - kv  18:            deepseek2.attention.q_lora_rank u32              = 1536
llama_model_loader: - kv  19:           deepseek2.attention.kv_lora_rank u32              = 512
llama_model_loader: - kv  20:             deepseek2.attention.key_length u32              = 192
llama_model_loader: - kv  21:           deepseek2.attention.value_length u32              = 128
llama_model_loader: - kv  22:       deepseek2.expert_feed_forward_length u32              = 2048
llama_model_loader: - kv  23:                     deepseek2.expert_count u32              = 256
llama_model_loader: - kv  24:              deepseek2.expert_shared_count u32              = 1
llama_model_loader: - kv  25:             deepseek2.expert_weights_scale f32              = 2.500000
llama_model_loader: - kv  26:              deepseek2.expert_weights_norm bool             = true
llama_model_loader: - kv  27:               deepseek2.expert_gating_func u32              = 2
llama_model_loader: - kv  28:             deepseek2.rope.dimension_count u32              = 64
llama_model_loader: - kv  29:                deepseek2.rope.scaling.type str              = yarn
llama_model_loader: - kv  30:              deepseek2.rope.scaling.factor f32              = 40.000000
llama_model_loader: - kv  31: deepseek2.rope.scaling.original_context_length u32              = 4096
llama_model_loader: - kv  32: deepseek2.rope.scaling.yarn_log_multiplier f32              = 0.100000
llama_model_loader: - kv  33:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  34:                         tokenizer.ggml.pre str              = deepseek-v3
llama_model_loader: - kv  35:                      tokenizer.ggml.tokens arr[str,129280]  = ["<∩╜£beginΓûüofΓû
üsentence∩╜£>", "<∩...
llama_model_loader: - kv  36:                  tokenizer.ggml.token_type arr[i32,129280]  = [3, 3, 3, 1, 1, 1,
 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  37:                      tokenizer.ggml.merges arr[str,127741]  = ["─á t", "─á a", "
i n", "─á ─á", "h e...
llama_model_loader: - kv  38:                tokenizer.ggml.bos_token_id u32              = 0
llama_model_loader: - kv  39:                tokenizer.ggml.eos_token_id u32              = 1
llama_model_loader: - kv  40:            tokenizer.ggml.padding_token_id u32              = 1
llama_model_loader: - kv  41:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  42:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  43:                    tokenizer.chat_template str              = {% if not add_gene
ration_prompt is de...
llama_model_loader: - kv  44:               general.quantization_version u32              = 2
llama_model_loader: - kv  45:                          general.file_type u32              = 15
llama_model_loader: - kv  46:                      quantize.imatrix.file str              = /models/DeepSeek-V
3-0324-GGUF/DeepSee...
llama_model_loader: - kv  47:                   quantize.imatrix.dataset str              = /workspace/calibra
tion_datav3.txt
llama_model_loader: - kv  48:             quantize.imatrix.entries_count i32              = 720
llama_model_loader: - kv  49:              quantize.imatrix.chunks_count i32              = 124
llama_model_loader: - kv  50:                                   split.no u16              = 0
llama_model_loader: - kv  51:                        split.tensors.count i32              = 1025
llama_model_loader: - kv  52:                                split.count u16              = 11
llama_model_loader: - type  f32:  361 tensors
llama_model_loader: - type q8_0:  151 tensors
llama_model_loader: - type q4_K:  154 tensors
llama_model_loader: - type q5_K:  153 tensors
llama_model_loader: - type q6_K:  206 tensors
llm_load_vocab: special tokens cache size = 818
llm_load_vocab: token to piece cache size = 0.8223 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = deepseek2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 129280
llm_load_print_meta: n_merges         = 127741
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 163840
llm_load_print_meta: n_embd           = 7168
llm_load_print_meta: n_layer          = 61
llm_load_print_meta: n_head           = 128
llm_load_print_meta: n_head_kv        = 128
llm_load_print_meta: n_rot            = 64
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_swa_pattern    = 1
llm_load_print_meta: n_embd_head_k    = 192
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 24576
llm_load_print_meta: n_embd_v_gqa     = 16384
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 18432
llm_load_print_meta: n_expert         = 256
llm_load_print_meta: n_expert_used    = 8
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = yarn
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 0.025
llm_load_print_meta: n_ctx_orig_yarn  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 671B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 671.026 B
llm_load_print_meta: model size       = 379.030 GiB (4.852 BPW)
llm_load_print_meta: repeating layers = 377.836 GiB (4.850 BPW, 669.173 B parameters)
llm_load_print_meta: general.name     = DeepSeek V3 0324
llm_load_print_meta: BOS token        = 0 '<∩╜£beginΓûüofΓûüsentence∩╜£>'
llm_load_print_meta: EOS token        = 1 '<∩╜£endΓûüofΓûüsentence∩╜£>'
llm_load_print_meta: PAD token        = 1 '<∩╜£endΓûüofΓûüsentence∩╜£>'
llm_load_print_meta: LF token         = 131 '├ä'
llm_load_print_meta: max token length = 256
llm_load_print_meta: n_layer_dense_lead   = 3
llm_load_print_meta: n_lora_q             = 1536
llm_load_print_meta: n_lora_kv            = 512
llm_load_print_meta: n_ff_exp             = 2048
llm_load_print_meta: n_expert_shared      = 1
llm_load_print_meta: expert_weights_scale = 2.5
llm_load_print_meta: expert_weights_norm  = 1
llm_load_print_meta: expert_gating_func   = sigmoid
llm_load_print_meta: rope_yarn_log_mul    = 0.1000
llm_load_tensors: ggml ctx size =    0.85 MiB
Tensor blk.3.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.3.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.3.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.4.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.4.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.4.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.5.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.5.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.5.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.6.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.6.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.6.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.7.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.7.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.7.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.8.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.8.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.8.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.9.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.9.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.9.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.10.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.10.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.10.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.11.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.11.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.11.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.12.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.12.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.12.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.13.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.13.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.13.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.14.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.14.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.14.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.15.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.15.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.15.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.16.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.16.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.16.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.17.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.17.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.17.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.18.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.18.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.18.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.19.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.19.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.19.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.20.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.20.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.20.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.21.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.21.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.21.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.22.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.22.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.22.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.23.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.23.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.23.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.24.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.24.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.24.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.25.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.25.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.25.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.26.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.26.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.26.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.27.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.27.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.27.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.28.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.28.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.28.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.29.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.29.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.29.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.30.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.30.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.30.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.31.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.31.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.31.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.32.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.32.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.32.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.33.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.33.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.33.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.34.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.34.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.34.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.35.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.35.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.35.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.36.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.36.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.36.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.37.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.37.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.37.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.38.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.38.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.38.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.39.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.39.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.39.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.40.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.40.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.40.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.41.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.41.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.41.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.42.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.42.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.42.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.43.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.43.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.43.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.44.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.44.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.44.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.45.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.45.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.45.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.46.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.46.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.46.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.47.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.47.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.47.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.48.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.48.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.48.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.49.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.49.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.49.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.50.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.50.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.50.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.51.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.51.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.51.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.52.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.52.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.52.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.53.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.53.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.53.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.54.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.54.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.54.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.55.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.55.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.55.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.56.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.56.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.56.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.57.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.57.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.57.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.58.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.58.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.58.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.59.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.59.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.59.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.60.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.60.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.60.ffn_up_exps.weight buffer type overriden to CPU
llm_load_tensors: offloading 61 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 62/62 layers to GPU
llm_load_tensors:        CPU buffer size = 375732.00 MiB
llm_load_tensors:  CUDA_Host buffer size =   497.11 MiB
llm_load_tensors:      CUDA0 buffer size = 11897.18 MiB
....................................................................................................
============ llm_prepare_mla: need to compute 61 wk_b/wv_b tensors
Computed blk.0.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.1.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.2.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.3.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.4.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.5.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.6.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.7.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.8.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.9.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.10.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.11.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.12.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.13.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.14.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.15.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.16.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.17.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.18.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.19.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.20.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.21.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.22.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.23.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.24.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.25.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.26.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.27.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.28.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.29.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.30.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.31.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.32.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.33.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.34.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.35.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.36.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.37.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.38.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.39.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.40.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.41.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.42.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.43.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.44.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.45.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.46.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.47.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.48.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.49.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.50.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.51.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.52.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.53.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.54.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.55.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.56.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.57.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.58.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.59.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.60.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
============ Repacked 174 tensors
llama_new_context_with_model: n_ctx      = 32768
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn   = 3
llama_new_context_with_model: attn_max_b = 512
llama_new_context_with_model: fused_moe  = 1
llama_new_context_with_model: ser        = -1, 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 0.025
llama_kv_cache_init:      CUDA0 KV buffer size =  2196.00 MiB
llama_new_context_with_model: KV self size  = 2196.00 MiB, c^KV (f16): 2196.00 MiB, kv^T: not used
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.49 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =  3393.00 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =   176.01 MiB
llama_new_context_with_model: graph nodes  = 8184
llama_new_context_with_model: graph splits = 118

main: n_kv_max = 32768, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_layers = 63, n_threads = 32, n_t
hreads_batch = 56

|    PP |     TG |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
|   512 |    128 |      0 |    3.950 |   129.61 |    9.283 |    13.79 |
|   512 |    128 |    512 |    3.854 |   132.87 |    8.692 |    14.73 |
|   512 |    128 |   1024 |    3.896 |   131.43 |    7.995 |    16.01 |
|   512 |    128 |   1536 |    3.941 |   129.92 |    7.937 |    16.13 |
|   512 |    128 |   2048 |    4.032 |   126.98 |    8.095 |    15.81 |
|   512 |    128 |   2560 |    4.089 |   125.21 |    7.976 |    16.05 |
|   512 |    128 |   3072 |    4.147 |   123.46 |    8.157 |    15.69 |
|   512 |    128 |   3584 |    4.216 |   121.43 |    8.032 |    15.94 |
|   512 |    128 |   4096 |    4.256 |   120.29 |    8.188 |    15.63 |
|   512 |    128 |   4608 |    4.283 |   119.53 |    8.253 |    15.51 |
|   512 |    128 |   5120 |    4.316 |   118.62 |    8.226 |    15.56 |
|   512 |    128 |   5632 |    4.352 |   117.63 |    8.121 |    15.76 |
|   512 |    128 |   6144 |    4.414 |   116.00 |    8.245 |    15.52 |
|   512 |    128 |   6656 |    4.462 |   114.74 |    8.311 |    15.40 |
|   512 |    128 |   7168 |    4.496 |   113.88 |    8.353 |    15.32 |
|   512 |    128 |   7680 |    4.552 |   112.47 |    8.287 |    15.45 |
|   512 |    128 |   8192 |    4.592 |   111.50 |    8.256 |    15.50 |
|   512 |    128 |   8704 |    4.640 |   110.35 |    8.329 |    15.37 |
|   512 |    128 |   9216 |    4.664 |   109.78 |    8.139 |    15.73 |
|   512 |    128 |   9728 |    4.641 |   110.31 |    8.282 |    15.46 |
|   512 |    128 |  10240 |    4.698 |   108.98 |    8.345 |    15.34 |
|   512 |    128 |  10752 |    4.823 |   106.15 |    8.338 |    15.35 |
|   512 |    128 |  11264 |    4.769 |   107.37 |    8.185 |    15.64 |
|   512 |    128 |  11776 |    4.788 |   106.94 |    8.234 |    15.55 |
|   512 |    128 |  12288 |    4.805 |   106.55 |    8.362 |    15.31 |
|   512 |    128 |  12800 |    4.840 |   105.78 |    8.406 |    15.23 |
|   512 |    128 |  13312 |    4.872 |   105.08 |    8.462 |    15.13 |
|   512 |    128 |  13824 |    4.891 |   104.67 |    8.502 |    15.05 |
|   512 |    128 |  14336 |    4.926 |   103.94 |    8.442 |    15.16 |
|   512 |    128 |  14848 |    4.968 |   103.06 |    8.467 |    15.12 |
|   512 |    128 |  15360 |    5.013 |   102.13 |    8.447 |    15.15 |
|   512 |    128 |  15872 |    5.061 |   101.17 |    8.454 |    15.14 |
|   512 |    128 |  16384 |    5.278 |    97.00 |    8.493 |    15.07 |
|   512 |    128 |  16896 |    5.319 |    96.26 |    8.635 |    14.82 |
|   512 |    128 |  17408 |    5.370 |    95.35 |    8.593 |    14.90 |
|   512 |    128 |  17920 |    5.421 |    94.45 |    8.562 |    14.95 |
|   512 |    128 |  18432 |    5.463 |    93.72 |    8.544 |    14.98 |
|   512 |    128 |  18944 |    5.494 |    93.20 |    8.546 |    14.98 |
|   512 |    128 |  19456 |    5.562 |    92.05 |    8.696 |    14.72 |
|   512 |    128 |  19968 |    5.612 |    91.24 |    8.595 |    14.89 |
|   512 |    128 |  20480 |    5.643 |    90.73 |    8.723 |    14.67 |
|   512 |    128 |  20992 |    5.695 |    89.91 |    8.771 |    14.59 |
|   512 |    128 |  21504 |    5.742 |    89.17 |    8.640 |    14.82 |
|   512 |    128 |  22016 |    5.761 |    88.87 |    8.794 |    14.55 |
|   512 |    128 |  22528 |    5.836 |    87.74 |    8.721 |    14.68 |
|   512 |    128 |  23040 |    5.880 |    87.08 |    8.841 |    14.48 |
|   512 |    128 |  23552 |    5.784 |    88.52 |    8.717 |    14.68 |
|   512 |    128 |  24064 |    5.848 |    87.55 |    8.923 |    14.34 |
|   512 |    128 |  24576 |    5.884 |    87.02 |    8.957 |    14.29 |
|   512 |    128 |  25088 |    5.931 |    86.33 |    8.984 |    14.25 |
|   512 |    128 |  25600 |    5.979 |    85.63 |    8.937 |    14.32 |
|   512 |    128 |  26112 |    6.015 |    85.12 |    8.982 |    14.25 |
|   512 |    128 |  26624 |    6.064 |    84.43 |    8.944 |    14.31 |
|   512 |    128 |  27136 |    6.122 |    83.63 |    8.948 |    14.31 |
|   512 |    128 |  27648 |    6.154 |    83.19 |    8.957 |    14.29 |
|   512 |    128 |  28160 |    6.211 |    82.44 |    9.005 |    14.21 |
|   512 |    128 |  28672 |    6.233 |    82.15 |    9.097 |    14.07 |
|   512 |    128 |  29184 |    6.302 |    81.24 |    9.255 |    13.83 |
|   512 |    128 |  29696 |    6.318 |    81.03 |    9.052 |    14.14 |
|   512 |    128 |  30208 |    6.389 |    80.14 |    9.392 |    13.63 |
|   512 |    128 |  30720 |    6.411 |    79.87 |    9.156 |    13.98 |
|   512 |    128 |  31232 |    6.483 |    78.97 |    9.254 |    13.83 |
|   512 |    128 |  31744 |    6.539 |    78.31 |    9.165 |    13.97 |
|   512 |    128 |  32256 |    6.611 |    77.44 |    9.009 |    14.21 |

I have NPS0 set in BIOS, and "LLC as NUMA domain (ACPI SRAT L3 Cache as NUMA domain)" ENABLED. It might be worth re-testing with this option DISABLED. I will test smaller and larger quants, too, but downloads take ages 😃.

Anyway, just wanted to say "thanks" and share my excitement 💯.
Any tips, insights or discussion would be welcome.

16 replies

ikawrakow Jul 10, 2025
Maintainer

@sousekd Your sweep-bench results look pretty good. IIRC, someone got up to 350 t/s prompt processing speed using -b 16384 -ub 16384 with 96 GB VRAM (all routed experts left on the CPU), but you need to go and pock around in the issues/discussions to find the setup and the model used (I'm not very well organized in keeping track of all the discussions). Also, I think it is better to remind us of your hardware (CPU, GPUs) instead of us having to go and search where they were posted.

While I can see that competition for PCI-E bandwidth/latency may hinder PP improvements, I'm not sure I understand why one cannot get TG speed improvement by having additional routed experts offloaded to the second GPU. No tensor data is copied from RAM to VRAM when generating tokens, so PCI-E shouldn't be a bottleneck, so I expect to see at least some TG speed improvement.

I'm quite interested in improving the speed further if possible, so I think it would be useful for you to post what you have tried and the results. You may want to start a new discussion for that as this one is getting difficult to follow all comments.

sousekd Jul 10, 2025

Thank you, @ikawrakow for your thoughts.

The system is an EPYC 9355 (32 cores) with 12x DDR5-6400, and the latest results above are from a single RTX 5090 on PCIe 5.0 x16. Previous results were from a single RTX 4090 on PCIe 4.0 x16. Combined - without much tuning of the parameters - both PP t/s and TG t/s were significantly lower than on a single GPU. Oh, and it's currently running on Windows Server - only temporarily.

I'm quite interested in improving the speed further if possible, so I think it would be useful for you to post what you have tried and the results. You may want to start a new discussion for that as this one is getting difficult to follow all comments.

Yes, I will play with params and benchmark more and once I have some results, I will open a new discussion. The reason I post these results (and params) are meant to help other people. When I was deciding on what hardware to buy for running these huge models the lack of available information and real results on larger contexts was putting me off. All I was able to find is "MacBook Pro can run DeepSeek", but no information about how the performance is degrading with growing context... and k-transformers for AMX.

Anyway, it is quite possible I am doing something wrong, or Windows. Thank you very much - the numbers are great as they are, but obviously one can always try to improve, and the fact the second GPU did not help surprised me.

ubergarm Jul 10, 2025
Author

@sousekd

Just helped some of the multi-gpu crew tune up their commands. Feel free to take a look on how they are achieving over 300 tok/sec PP and almost 20 tok/sec TG on my newest quants (using very fast IQ2_KS and the new IQ3_KS): https://huggingface.co/ubergarm/DeepSeek-TNG-R1T2-Chimera-GGUF/discussions/2

Yeah feel free to start a new discussion listing your hardware and multi-GPU arrangement as well as your current command and folks can help workshop it. There is a lot of confusion partially from my own older mistakes still floating around as well as the fact that Qwen3 has different tensor names than DeepSeek so the override-tensor regex commands look similar but are importantly different.

I originally planned to buy two CPUs and spread memory across two sockets (to get 24 channels to RAM), but then reading about NUMA issues I realized it might not help much - quite the opposite. Even cross-CCDs memory access has a negative effect, so I can see why PCIE transfers should be avoided as much as possible.

Yeah give your BIOS configuration as well e.g. if you have dual socket are you running NPS0 (normally not a good idea, but for this workload probably best if you can't fit the model in a single socket's worth of RAM in NPS1) etc...

I believe if you use dual GPU and are offloading efficiently TG should definitely be like ~1 tok/sec faster or so probably as a 4090 with ~1TB/sec VRAM bandwidth bets almost any CPU RAM speeds.

ikawrakow Jul 10, 2025
Maintainer

@ubergarm

Btw, the other day I randomly came across a discussion in the KTransformers repository where 2 guys were thinking that ik_llama.cpp requires a "different format" (and they didn't like that). Apparently they came to that conclusion because of your ik_llama.cpp specific quants on HF. See this comment (and you may want to read the response to my comment). So, perhaps it would be a good idea to actually add a clarification to your HF repos that ik_llama.cpp also works with "standard" GGUFs, so people don't need to download these giant models just to try ik_llama.cpp.

ubergarm Jul 10, 2025
Author

I attempted to address it there also: kvcache-ai/ktransformers#1417 (comment)

I'll spend some time updating my huggingface model cards so hopefully people don't make this mistake and accidentally spread more misinformation.

Kind of reminds me of "Brandolini's law" aka the "bullshit asymmetry principle":

The amount of energy needed to refute bullshit is an order of magnitude bigger than that needed to produce it.

thanks

UPDATE:
Adding this to the model cards:

NOTE ik_llama.cpp can also run your existing GGUFs from bartowski, unsloth, mradermacher, etc if you want to try it out before downloading my quants.

ikawrakow · 2025-06-24T14:16:26Z

ikawrakow
Jun 24, 2025
Maintainer

@sousekd

Thank you for the kind words!

Honestly, I’m unsure if I'm losing performance by disabling GGML_AVX512_BF16, but I couldn't compile it with MSVC otherwise. Similarly, I'm curious about any actual benefits from enabling both GGML_AVX512 and GGML_AVX512_VNNI as I have not seen them mentioned in the guide - so I'd love some insights here!

Please post the compilation errors you get with AVX512_BF16. It is supposed to work, but I guess there is GCC/clang-specific stuff that I must have missed. The only impact native BF16 support has is when running inference with bf16 models, so you will not see a difference with quantized models.

There are places where I have added GEMM/GEMV implementations optimized for AVX512 extensions that I have available on my Ryzen-7950X CPU (Zen4 core). To be effective, one needs to enable AVX512, AVX512_VNNI, AVX512VL, AVX512BW and AVX512DQ. I don't think these are all available via GGML_something cmake definitions. When building on Linux they all get enabled with GGML_NATIVE, but on Windows you most likely need to work with -DGGML_ARCH_FLAGS=add_necessary_compiler_flags. TG performance is memory bound, so there will not be much impact there, but for PP you may get some additional performance increases if your CPU supports all of these.

1 reply

sousekd Jun 24, 2025

Please post the compilation errors you get with AVX512_BF16. It is supposed to work, ...

Oh, you are 100% correct and I am an idiot. ik_llama.cpp builds perfectly fine with -DGGML_AVX512_BF16=ON using MSVC - it was (and is) llama.cpp which does not build. I was experimenting with both and got confused :). Thank you!

createthis · 2025-07-10T16:13:24Z

createthis
Jul 10, 2025

I have a dual EPYC 9355 system which normally has 768gb of RAM across 24 channels and scores roughly 720gb/s memory bandwidth on the stream triad test. At the moment, I had a RDIMM failure, so I'm down a stick and I only have 23 channels and 736gb of system RAM. I also have a blackwell 6000 pro on this system.

I run with NPS4 set in the system BIOS, so I have 8 numa domains. I typically run Deepseek-V3-0324 671b:Q4_K_XL, so that's the model I'll be showing benchmarks for here.

I run this before every llama server startup:

echo 0 | sudo tee /proc/sys/kernel/numa_balancing
echo 3 | sudo tee /proc/sys/vm/drop_caches

Using llama.cpp, it's common to see 20 - 22 tok/s generation and between 5 and 40 tok/s PP. Example benchmark:

./build/bin/llama-batched-bench \
    --model /data/DeepSeek-V3-0324-GGUF-UD/UD-Q4_K_XL/DeepSeek-V3-0324-UD-Q4_K_XL-00001-of-00008.gguf \
    --numa numactl \
    --threads 32 \
    --ctx-size 163840 \
    --n-gpu-layers 62 \
    -ot ".ffn_.*_exps.=CPU" \
    --seed 3407 \
    --prio 3 \
    --temp 0.3 \
    --min-p 0.0 \
    --flash-attn \
-npp 512 -ntg 128 -npl 1

main: n_kv_max = 163840, n_batch = 2048, n_ubatch = 512, flash_attn = 1, is_pp_shared = 0, n_gpu_layers = 62, n_threads = 32, n_threads_batch = 32

|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|   512 |    128 |    1 |    640 |   24.441 |    20.95 |    5.973 |    21.43 |   30.414 |    21.04 |

With ik_llama.cpp, I see significantly higher PP tok/s, but significantly lower generation tok/s. I played with a few settings and this is my best benchmark so far:

./build/bin/llama-sweep-bench \
    --model /data/DeepSeek-V3-0324-GGUF-UD/UD-Q4_K_XL/DeepSeek-V3-0324-UD-Q4_K_XL-00001-of-00008.gguf \
    --alias DeepSeek-V3-0324:671b-q4_k_xl \
    --numa numactl \
    --threads 32 \
    --ctx-size 163840 \
    --n-gpu-layers 62 \
    -ot ".ffn_.*_exps.=CPU" \
    --seed 3407 \
    --temp 0.3 \
    --min-p 0.0 \
    --flash-attn \
    --host 0.0.0.0 \
    -mla 3 \
    -fmoe \
    -rtr \
    --port 11434

main: n_kv_max = 163840, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_layers = 62, n_threads = 32, n_threads_batch = 32

|    PP |     TG |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
|   512 |    128 |      0 |    3.862 |   132.56 |   15.186 |     8.43 |
|   512 |    128 |    512 |    3.851 |   132.94 |   15.240 |     8.40 |
|   512 |    128 |   1024 |    3.873 |   132.19 |   15.232 |     8.40 |
|   512 |    128 |   1536 |    3.925 |   130.45 |   15.253 |     8.39 |

I'm just curious: Why is generation tok/s so much lower in ik_llama.cpp vs llama.cpp? I think I prefer the higher PP speed for agentic work, but I haven't tested enough to decide yet. I'm just curious why there is such a dramatic generation difference.

Thanks!

2 replies

ubergarm Jul 10, 2025
Author

Hey thanks for taking some time to try this out. I too started using ktransformers but have since moved over to ik's for given he is the author on pretty much all the quants after the original q8_0 types.

I run with NPS4 set in the system BIOS, so I have 8 numa domains.

Both myself an fairydreaming have done a lot of research on the NUMA domain issue for both intel xeon and amd epyc dual socket rigs.

the tl;dr; is I recommend you try out NPS0 for dual socket systems given the nature of this workload being not optimized. The more NUMA nodes you have likely the worse performance, but if you must use more NUMA domains because of other system workloads then consider running with either:

# if u need RAM from all NUMA nodes to fit the model
numactl --interleave=all llama-server --numactl distribute ...

# if a single NUMA node (e.g. in NPS1) has enough RAM:
numactl -N 0 -m 0 llama-server --numactl numactl ...

Generally PP will benefit from as much physical cores that you can throw at it, but TG will likely be fastest with some smaller number of threads so get the best of both worlds with --threads-batch <num_phys_corses> --threads <slightly_less_sometimes> etc...

I've been helping folks tune their exact command to get max speed, so I'll take a crack at yours as it stands assuming you are still running with 8 numa domains and haven't attempted the above BIOS optimizations yet:

# build for RTX PRO Blackwel 96GB VRAM arch/capabilities 120 psure
cmake -B ./build -DGGML_CUDA=ON -DGGML_SCHED_MAX_COPIES=1 -DGGML_CUDA_IQK_FORCE_BF16=1 -DCMAKE_CUDA_ARCHITECTURES="120"
cmake --build ./build --config Release -j $(nproc)

# run on single CPU socket assuming NPS4 (4x domains per socket)
numactl --interleave=0,1,2,3 \
./build/bin/llama-sweep-bench \
    --model /data/DeepSeek-V3-0324-GGUF-UD/UD-Q4_K_XL/DeepSeek-V3-0324-UD-Q4_K_XL-00001-of-00008.gguf \
    -fa -mla 3 -fmoe -amb 512 -mg 0 \
    --ctx-size 20480 \
    -ngl 99 \
    -ot "blk\.(3|4|5|6|7|8)\.ffn_.*=CUDA0" \
    -ot exps=CPU \
    --threads 32 \
    --threads-batch 32 \
    -ub 4096 -b 4096 \
    -rtr \
    --numa numactl \
    --warmup-batch

Adjust -ot "blk\.(3|4|5|6|7|8)\.ffn_.*=CUDA0" \ as high as it goes without OOMing... This is how we do multi-GPU here vs ktransformers chat yaml things. Also here on ik's fork there is no performance hit offloading additonal layers like ktransformers (at least used to have) due to its cuda graphs stuff.

-DGGML_SCHED_MAX_COPIES=1 is also in mainline llama.cpp and the default is 4 pipeline parallel but using 1 is much more simple and allows more VRAM and easier for multi-GPU and then just increase batches for more speed. You will possibly see a debug log like llama_new_context_with_model: pipeline parallelism enabled (n_copies=1).

Once you've dialed in the command you can then just switch out the executable back to llama-server and add back in alias/host/port and remove --warmup-batch.

Okay, let me know if u have any questions, you have a very nice rig!

sousekd Jul 10, 2025

Hi @createthis, I was able to achieve the following on (single) Epyc 9355 and RTX 5090:

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
2048	512	0	12.944	158.22	31.369	16.32
2048	512	2048	13.033	157.14	31.081	16.47
2048	512	4096	14.656	139.74	32.354	15.83

As @ubergarm noted, try NPS0. Also, did you experiment with --numa param? I am not sure how/whether it is supported here.

Edit: Huh, haven't seen @ubergarm's full response 😀.

ikawrakow · 2025-07-10T16:46:17Z

ikawrakow
Jul 10, 2025
Maintainer

@createthis

I think you are observing a difference in GPU offload policy. In llama.cpp model tensors that are stored in RAM will get offloaded to the GPU whenever the batch size is greater than 32 tokens. This results in a seriously low PP performance for a MoE model and the batch sizes you are using. But fort TG, because the tokens are generated in batches, the offload to the GPU helps, and you get a better TG performance (which is about the same as PP, as you are basically measuring how long it takes to offload tensors to the GPU). In ik_llama.cpp I have changed the offload to the GPU for MoE models to only kick in if the batch size is greater than

32 * total_experts / active_experts

which for DeepSeek-R1/V3 translates to 1024 tokens. So, basically, in this benchmark you are not using the GPU at all, everything runs on the CPU when using ik_llama.cpp!

batched-bench results can be quite confusing and not immediately easy to interpret. Unless you are planning to be serving multiple users at once (and using relatively small batches to reduce response latency), it may be easier to get going by looking at PP and TG performance as a function of the tokens in the KV cache. In ik_llama.cpp you have llama-sweep-bench for that, so for instance

./bin./llama-sweep-bench -m $model -c 32768 -b 4096 -ub 4096 -mla 3 -fa -fmoe -amb 512 -t 32 -ngl 100 -ot exps=CPU

will give a nice table with PP and TG performance for 0...32k tokens in the KV cache.

I think in llama.cpp they have added the --depth argument to llama-bench that allows you to get similar results.

Another comment related to the NUMA situation: I don't have access to a NUMA system myself, but people report that, sadly, on dual socket systems they get the best performance by disabling NUMA in the BIOS and running on a single CPU. @ubergarm has done quite a few experiments in that regard. I haven't followed what is happening in llama.cpp land on that front, so maybe they have improved in the meantime (but hadn't only 2-3 months ago).

1 reply

ikawrakow Jul 10, 2025
Maintainer

But apart from everything else, worth pointing out that ik_llama.cpp needs only half the total time for PP+TG compared to llama.cpp.

Panchovix · 2025-07-10T20:39:17Z

Panchovix
Jul 10, 2025

Just to let you know guys, did some benchmarks on iklcpp on my setup (192GB RAM + 208GB VRAM) on DeepSeek V3/R1/Chimera of Q2_K_XL, IQ3_XXS, IQ3_KS, Q3_K_XL and IQ4_XS on reddit, if you want to take a look!

https://www.reddit.com/r/LocalLLaMA/comments/1lwnj5x/performance_benchmarks_on_deepseek/

Performance of ikllamacpp for these kind of setups, is really impressive!

0 replies

createthis · 2025-07-10T21:35:54Z

createthis
Jul 10, 2025

@ikawrakow here it is with NPS0:

mla 3

./build/bin/llama-sweep-bench \
    --model /data/DeepSeek-V3-0324-GGUF-UD/UD-Q4_K_XL/DeepSeek-V3-0324-UD-Q4_K_XL-00001-of-00008.gguf \
    --alias DeepSeek-V3-0324:671b-q4_k_xl \
    --threads 32 \
    --ctx-size 163840 \
    --n-gpu-layers 62 \
    -ot ".ffn_.*_exps.=CPU" \
    --seed 3407 \
    --temp 0.3 \
    --min-p 0.0 \
    --flash-attn \
    --host 0.0.0.0 \
    -mla 3 \
    -fmoe \
    -rtr \
    --port 11434

main: n_kv_max = 163840, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_layers = 62, n_threads = 32, n_threads_batch = 32

|    PP |     TG |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
|   512 |    128 |      0 |    3.677 |   139.23 |   12.996 |     9.85 |
|   512 |    128 |    512 |    3.994 |   128.19 |   13.160 |     9.73 |
|   512 |    128 |   1024 |    4.020 |   127.37 |   13.161 |     9.73 |
|   512 |    128 |   1536 |    4.279 |   119.65 |   13.426 |     9.53 |
|   512 |    128 |   2048 |    4.193 |   122.11 |   13.596 |     9.41 |
|   512 |    128 |   2560 |    3.868 |   132.38 |   12.987 |     9.86 |
|   512 |    128 |   3072 |    4.655 |   109.98 |   13.682 |     9.36 |
|   512 |    128 |   3584 |    4.291 |   119.31 |   13.344 |     9.59 |
|   512 |    128 |   4096 |    4.287 |   119.44 |   12.890 |     9.93 |
|   512 |    128 |   4608 |    4.221 |   121.29 |   12.835 |     9.97 |

mla 2

./build/bin/llama-sweep-bench \
    --model /data/DeepSeek-V3-0324-GGUF-UD/UD-Q4_K_XL/DeepSeek-V3-0324-UD-Q4_K_XL-00001-of-00008.gguf \
    --alias DeepSeek-V3-0324:671b-q4_k_xl \
    --threads 32 \
    --ctx-size 163840 \
    --n-gpu-layers 62 \
    -ot ".ffn_.*_exps.=CPU" \
    --seed 3407 \
    --temp 0.3 \
    --min-p 0.0 \
    --flash-attn \
    --host 0.0.0.0 \
    -mla 2 \
    -fmoe \
    -rtr \
    --port 11434
    
main: n_kv_max = 163840, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_layers = 62, n_threads = 32, n_threads_batch = 32

|    PP |     TG |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
|   512 |    128 |      0 |    3.766 |   135.95 |   12.805 |    10.00 |
|   512 |    128 |    512 |    3.774 |   135.66 |   12.753 |    10.04 |
|   512 |    128 |   1024 |    3.833 |   133.59 |   13.051 |     9.81 |
|   512 |    128 |   1536 |    4.051 |   126.38 |   13.200 |     9.70 |
|   512 |    128 |   2048 |    3.882 |   131.89 |   13.089 |     9.78 |
|   512 |    128 |   2560 |    3.887 |   131.71 |   13.085 |     9.78 |
|   512 |    128 |   3072 |    3.993 |   128.24 |   13.275 |     9.64 |
|   512 |    128 |   3584 |    4.380 |   116.89 |   13.879 |     9.22 |
|   512 |    128 |   4096 |    4.273 |   119.82 |   13.199 |     9.70 |
|   512 |    128 |   4608 |    4.115 |   124.41 |   12.996 |     9.85 |

Doesn't seem to make much difference mla 2 vs 3.

PP speed does continue to rise past 32 threads though, which is suprising:

./build/bin/llama-sweep-bench \
    --model /data/DeepSeek-V3-0324-GGUF-UD/UD-Q4_K_XL/DeepSeek-V3-0324-UD-Q4_K_XL-00001-of-00008.gguf \
    --alias DeepSeek-V3-0324:671b-q4_k_xl \
    --threads 61 \
    --ctx-size 163840 \
    --n-gpu-layers 62 \
    -ot ".ffn_.*_exps.=CPU" \
    --seed 3407 \
    --temp 0.3 \
    --min-p 0.0 \
    --flash-attn \
    --host 0.0.0.0 \
    -mla 2 \
    -fmoe \
    -rtr \
    --port 11434  main: n_kv_max = 163840, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_layers = 62, n_threads = 61, n_threads_batch = 61

|    PP |     TG |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
|   512 |    128 |      0 |    3.274 |   156.36 |   12.792 |    10.01 |
|   512 |    128 |    512 |    3.174 |   161.33 |   12.924 |     9.90 |
|   512 |    128 |   1024 |    3.099 |   165.22 |   13.011 |     9.84 |
|   512 |    128 |   1536 |    3.204 |   159.83 |   13.140 |     9.74 |
|   512 |    128 |   2048 |    3.196 |   160.22 |   13.131 |     9.75 |
|   512 |    128 |   2560 |    3.093 |   165.54 |   13.327 |     9.60 |
|   512 |    128 |   3072 |    3.443 |   148.70 |   13.393 |     9.56 |
|   512 |    128 |   3584 |    3.369 |   151.97 |   13.454 |     9.51 |
|   512 |    128 |   4096 |    3.413 |   150.02 |   13.577 |     9.43 |

12 replies

createthis Jul 11, 2025

Another sort of interesting result: This is NPS4 with llama.cpp:

./build/bin/llama-batched-bench \
    --model /data/DeepSeek-V3-0324-GGUF-UD/UD-Q4_K_XL/DeepSeek-V3-0324-UD-Q4_K_XL-00001-of-00008.gguf \
    --numa numactl \
    --threads 32 \
    --ctx-size 163840 \
    --n-gpu-layers 62 \
    -ot "blk\.(3|4|5|6|7|8|9|10|11)\.ffn_.*=CUDA0" \
    -ot exps=CPU \
    -ub 4096 -b 4096 \
    --seed 3407 \
    --prio 3 \
    --temp 0.3 \
    --min-p 0.0 \
    --flash-attn \
    -npp 4096 \
    -ntg 1024 \
    -npl 1

main: n_kv_max = 163840, n_batch = 4096, n_ubatch = 4096, flash_attn = 1, is_pp_shared = 0, n_gpu_layers = 62, n_threads = 32, n_threads_batch = 32

|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|  4096 |   1024 |    1 |   5120 |   21.628 |   189.38 |   46.545 |    22.00 |   68.173 |    75.10 |

The "real world" numbers, which are small context, as you pointed out:

./build/bin/llama-server \
    --model /data/DeepSeek-V3-0324-GGUF-UD/UD-Q4_K_XL/DeepSeek-V3-0324-UD-Q4_K_XL-00001-of-00008.gguf \
    --alias DeepSeek-V3-0324:671b-q4_k_xl \
    --no-webui \
    --numa numactl \
    --threads 32 \
    --ctx-size 163840 \
    --n-gpu-layers 62 \
    -ot "blk\.(3|4|5|6|7|8|9|10|11)\.ffn_.*=CUDA0" \
    -ot exps=CPU \
    -ub 4096 -b 4096 \
    --seed 3407 \
    --prio 3 \
    --temp 0.3 \
    --min-p 0.0 \
    --log-colors \
    --flash-attn \
    --host 0.0.0.0 \
    --jinja \
    --port 11434

It is a huge time suck switching between NPS0 and NPS4. The machine takes like 10 minutes to reboot.

@ubergarm I'm interested in trying out your llama.cpp sweep benchmark. I need to get some work done on a paid project at the moment, but I'll try to take a look later this weekend and report my findings. I'll also report higher context real world results as they come in. I don't have an agentic workload at the moment, so I'm not sure when that will be, but maybe I can fabricate one this weekend if nothing pops up today.

Thanks for all the feedback and help thus far!

createthis Jul 11, 2025

This is still NPS4 with llama.cpp, just because I've been too lazy to reboot into NPS0.

I'm never 100% sure I'm reading these correctly, but I think this is performance at 47k context:

./build/bin/llama-server \
    --model /data/DeepSeek-V3-0324-GGUF-UD/UD-Q4_K_XL/DeepSeek-V3-0324-UD-Q4_K_XL-00001-of-00008.gguf \
    --alias DeepSeek-V3-0324:671b-q4_k_xl \
    --no-webui \
    --numa numactl \
    --threads 32 \
    --ctx-size 163840 \
    --n-gpu-layers 62 \
    -ot "blk\.(3|4|5|6|7|8|9|10|11)\.ffn_.*=CUDA0" \
    -ot exps=CPU \
    -ub 4096 -b 4096 \
    --seed 3407 \
    --prio 3 \
    --temp 0.3 \
    --min-p 0.0 \
    --log-colors \
    --flash-attn \
    --host 0.0.0.0 \
    --jinja \
    --port 11434

Not too shabby performance.

EDIT: updated to be the same prompt as the below 47k context "real world" examples for an apples to apples comparison

createthis Jul 11, 2025

"real world" NPS0 with llama.cpp and 47k context (same prompt as last one, I just hit regenerate):

./build/bin/llama-server \
    --model /data/DeepSeek-V3-0324-GGUF-UD/UD-Q4_K_XL/DeepSeek-V3-0324-UD-Q4_K_XL-00001-of-00008.gguf \
    --alias DeepSeek-V3-0324:671b-q4_k_xl \
    --no-webui \
    --threads 32 \
    --ctx-size 163840 \
    --n-gpu-layers 62 \
  -ot "blk\.(3|4|5|6|7|8|9|10|11)\.ffn_.*=CUDA0" \
  -ot exps=CPU \
  -ub 4096 -b 4096 \
    --seed 3407 \
    --prio 3 \
    --temp 0.3 \
    --min-p 0.0 \
    --log-colors \
    --flash-attn \
    --host 0.0.0.0 \
    --jinja \
    --port 11434

This is in-line with my original findings. llama.cpp seems to prefer NPS4 for some reason.

createthis Jul 11, 2025

"real world" NPS0 ik_llama.cpp 47k context. I just replayed the last prompt.

./build/bin/llama-server \
    --model /data/DeepSeek-V3-0324-GGUF-UD/UD-Q4_K_XL/DeepSeek-V3-0324-UD-Q4_K_XL-00001-of-00008.gguf \
    --alias DeepSeek-V3-0324:671b-q4_k_xl \
    -ot "blk\.(3|4|5|6|7|8|9|10|11)\.ffn_.*=CUDA0" \
    -ot exps=CPU \
    --threads 32 \
    --threads-batch 64 \
    -ub 4096 -b 4096 \
    --ctx-size 163840 \
    --n-gpu-layers 62 \
    --seed 3407 \
    --temp 0.3 \
    --min-p 0.0 \
    --flash-attn \
    --host 0.0.0.0 \
    -mla 3 \
    -fmoe \
    -amb 512 \
    -mg 0 \
    -rtr \
    --port 11434

This performance is quite good. PP is slightly better than NPS4 llama.cpp. Gen is a fair bit lower though. Based on these numbers alone, I would probably opt for llama.cpp with NPS4, but I'm not convinced the verdict is out yet. I plan to run them both agentically for a while and see which one I like better.

magikRUKKOLA Jul 11, 2025

@createthis as related to the comparison of ik_llama.cpp and llama.cpp. The following likely unrelated to your case, but I will mention it just in case someone else would have the issue. Today I was installing the ik_llama.cpp and was unable to [do] it. It was falling out with:

undefined symbol: ggml_backend_reg_get_count

after stracing it I realized that the compiled ik_llama.cpp binary is trying to pickup the /usr/local/lib/libggml.so and /usr/local/lib/libggml-base.so. I realized that these are from the old installation of ollama! Hence please make sure that the ik_llama.cpp doesn't pick up the libraries from the llama.cpp and wise versa lol! Again, it might be absolutely unrealed but still.

magikRUKKOLA · 2025-07-10T23:24:47Z

magikRUKKOLA
Jul 10, 2025

transferring from kvcache-ai/ktransformers#1417

Short story -- I would like to switch to the ik_llama.cpp from ktransformers (the ktransformers are having huge problems with the stability).

I would like to know how I can run Deepseek R1/V3 with 128k context and more.

In the ktransformers they used the matrix absorption trick ( https://docs.flashinfer.ai/api/mla.html, https://github.com/madsys-dev/deepseekv2-profile/blob/main/workspace/blog/optimizing-mla.md ) -- that is, the flashinfer allows to use one 24GB GPU to prefill up to 128k context (i never tried more because I didn't know the Deepseek supports 163k).

So what can be done currently in my case to support large context? I have a various machines mostly with Threadripper Pro 3995wx (inc. lenovo-locked), overclocked Samsung ECC RAM up to 3200 MT/s and currently up to 3 GPUs RTX 3090 FE per workstation with p2p enabled:

/opt/nvidia/cuda-samples/Samples/5_Domain_Specific/p2pBandwidthLatencyTest/build/p2pBandwidthLatencyTest
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, NVIDIA GeForce RTX 3090, pciBusID: 41, pciDeviceID: 0, pciDomainID:0
Device: 1, NVIDIA GeForce RTX 3090, pciBusID: 42, pciDeviceID: 0, pciDomainID:0
Device: 2, NVIDIA GeForce RTX 3090, pciBusID: 61, pciDeviceID: 0, pciDomainID:0
Device=0 CAN Access Peer Device=1
Device=0 CAN Access Peer Device=2
Device=1 CAN Access Peer Device=0
Device=1 CAN Access Peer Device=2
Device=2 CAN Access Peer Device=0
Device=2 CAN Access Peer Device=1

Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1      2
     0 840.24  52.01  51.95
     1  52.01 839.38  52.04
     2  52.04  52.04 840.28

P2P=Enabled Latency (P2P Writes) Matrix (us)
   GPU     0      1      2
     0   1.62   1.08   1.06
     1   1.07   1.58   1.05
     2   1.08   1.09   1.59

   CPU     0      1      2
     0   2.55   2.08   2.10
     1   2.26   2.58   2.15
     2   2.11   2.04   2.51

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 575.51.02              Driver Version: 575.51.02      CUDA Version: 12.9     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090        Off |   00000000:41:00.0 Off |                  N/A |
| 30%   43C    P8             20W /  350W |    4225MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 3090        Off |   00000000:42:00.0 Off |                  N/A |
|  0%   39C    P8              8W /  350W |   18529MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA GeForce RTX 3090        Off |   00000000:61:00.0 Off |                  N/A |
|  0%   42C    P8              9W /  350W |   16063MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A         3181836      C   whisper-server                         4216MiB |
|    1   N/A  N/A         3637807      C   llama-server                          18520MiB |
|    2   N/A  N/A         3637807      C   llama-server                          16054MiB |
+-----------------------------------------------------------------------------------------+

Currently researching what @ubergarm suggested and actually trying to fix the bug in ktransformers.

Please advise what can be done.

[EDIT]:

Currently doing this:

CUDA_VISIBLE_DEVICES="0" \
/opt/ik_llama.cpp/ik_llama.cpp/build/bin/llama-server \
    --model /opt/anikifoss/DeepSeek-R1-0528-DQ4_K_R4/DeepSeek-R1-0528-DQ4_K_R4-00001-of-00010.gguf \
    --alias anikifoss/DeepSeek-R1-0528-DQ4_K_R4 \
    --ctx-size $((41 * 1024)) \
    --temp 0.5 --top-k 0 --top-p 1.0 --min-p 0.1 --repeat-penalty 1.0 \
    -ctk q8_0 \
    -mla 3 -fa \
    -amb 512 \
    -b 1024 -ub 1024 \
    -fmoe \
    --n-gpu-layers 99 \
    --override-tensor exps=CPU,attn_kv_b=CPU \
    --threads $(grep ^cpu\\scores /proc/cpuinfo | uniq | awk '{print $4}' | xargs -I{} echo "{}-0" | bc) \
    --host 0.0.0.0 \
    --port 8080

Its running well on a single GPU but its only 41k context.
-mla 3 is significantly better that -mla 2 for decode t/s in my case.

[EDIT2]: it seems to be that lots of people having trouble using flashinfer instead of flash attention. For example:

https://github.com/turboderp-org/exllamav3

FlashAttention-2 is currently required. I hope to switch over to FlashInfer in time, but there are some obstacles to overcome first.

The same thing goes for ik_llama.cpp etc. -- the matrix absorption trick in flash infer is not available in flashattn hence the for the full context in ik_llama.cpp its required to have at least 48 GB VRAM which is not ideal.

20 replies

Panchovix Jul 12, 2025

I could not find the perplexity for the UD-Q4_K_XL at the graphs so I am posting it here:
DeepSeek-R1-0528-GGUF/UD-Q4_K_XL
Final estimate: PPL = 3.2483 +/- 0.01726
So the IQ4_KS_R4 is better in terms of perplexity.

Hello there! Wondering, what was your command to test PPL? I want to try with some models I have but I get just "nan" for some reason, so maybe it's an issue on my end (highly factible). And these models work perfectly on normal usage.

magikRUKKOLA Jul 12, 2025

You just get what?

The docs on Perplexity is in this current thread (see above). quote:

Perplexity

# Test your quant against known quants
# Lower is Better
# https://github.com/ikawrakow/ik_llama.cpp/pull/239#issuecomment-2701019253
# example command: https://github.com/ikawrakow/ik_llama.cpp/pull/239#issuecomment-2708537247
wget https://github.com/user-attachments/files/19090237/wiki.test.raw.gz
gunzip wiki.test.raw.gz

# this can takes an hour or more for full run
# but only really need first ~25 points or so
# also some quants give nan results even on vanilla llama.cpp
# *NOTE* I don't think `-ctk q8_0 -ctv q8_0` are valid with `-mla 2 -fa` yet so take this with a grain of salt.
CUDA_VISIBLE_DEVICES="0," \
./build/bin/llama-perplexity \
    --model /mnt/raid/models/ubergarm/DeepSeek-R1-GGUF/DeepSeek-R1-IQ2_XS_R4.gguf \
    -ctk q8_0 \
    -mla 2 -fa \
    -amb 512 \
    -fmoe \
    --ctx-size 512 \
    --ubatch-size 512 \
    -f wiki.test.raw \
    --n-gpu-layers 63 \
    --override-tensor exps=CPU \
    --threads 24

ikawrakow Jul 12, 2025
Maintainer

The quoted comments about NaNs and -mla 2 are hopelessly outdated.

ubergarm Jul 12, 2025
Author

Thanks for the result on that perplexity score @magikRUKKOLA it lines up with my own estimates of the smaller quants. That guide is indeed hopelessly outdated already haha.. Using q8_0 quantized cache will drop the score just a tiny bit, and mla 3 is pretty much always the way to go now.

Here is an example of what I've been using lately for smaller models and two CUDA GPUs:

./build/bin/llama-perplexity \
    --model "$model" \
    -f wiki.test.raw \
    --seed 1337 \
    -fa \
    -mla 3 -fmoe -amb 512 \
    -ctk f16 \
    -ngl 99 \
    -ot "blk\.(3|4|5|6|7|8|9|10|11|12|13|14|15)\.ffn_.*=CUDA0" \
    -ot "blk\.(16|17|18|19|20|21|22|23|24|25|26|27|28)\.ffn_.*=CUDA1" \
    -ot exps=CPU \
    --threads 24

Panchovix Jul 12, 2025

Many thanks to all! I did re test and finally worked, after months haha.

Finally could test R1 0525 IQ4_XS, from unsloth.

Result is

DeepSeek-R1-0528-IQ4_XS-merged.gguf
Final estimate: PPL = 3.2598 +/- 0.01727

So it is surprisingly close to Q4_K_XL, but probably is slower for TG.

Also both are really close to Q8 (3.2119), by 1-2%.

Finally I will be able to test V3 0324 quants PPL, but I don't have the Q8 ppl sadly haha.

Quick-start Guide coming over from llama.cpp and ktransformers! #258

Uh oh!

Uh oh!

ubergarm Mar 14, 2025

ik_llama.cpp

tl;dr;

Install

Features

Quick Start

Existing DeepSeek-R1 671B GGUF

Custom Quant

Custom Quants

Benchmarking

Test Rig

llama-bench

Perplexity

Q8_0

ubergarm Q2_K_R4

ubergarm Q2_K_R4 with various -ser N,1

ubergarm IQ2_BN_R4

ubergarm IQ2_K_R4

Debugging Crashes

TODO

References

Replies: 30 comments · 161 replies

Uh oh!

ubergarm Mar 14, 2025 Author

Uh oh!

saood06 Mar 15, 2025 Collaborator

Uh oh!

magikRUKKOLA Jul 13, 2025

Uh oh!

saood06 Jul 13, 2025 Collaborator

Uh oh!

Uh oh!

magikRUKKOLA Jul 13, 2025

Uh oh!

ikawrakow Mar 15, 2025 Maintainer

Uh oh!

Uh oh!

saood06 Mar 16, 2025 Collaborator

Uh oh!

Uh oh!

ubergarm Mar 19, 2025 Author

Uh oh!

vaulter Mar 20, 2025

Uh oh!

ubergarm Mar 20, 2025 Author

Uh oh!

vaulter Mar 20, 2025

Uh oh!

vaulter Mar 23, 2025

Uh oh!

saood06 Mar 20, 2025 Collaborator

Uh oh!

Uh oh!

saood06 Mar 21, 2025 Collaborator

Uh oh!

Uh oh!

saood06 Mar 23, 2025 Collaborator

Uh oh!

ubergarm Mar 23, 2025 Author

Uh oh!

Uh oh!

saood06 Mar 23, 2025 Collaborator

Uh oh!

ikawrakow Mar 23, 2025 Maintainer

Uh oh!

ikawrakow Mar 23, 2025 Maintainer

Uh oh!

ikawrakow Mar 21, 2025 Maintainer

Uh oh!

ubergarm Mar 21, 2025 Author

Uh oh!

ikawrakow Mar 21, 2025 Maintainer

Uh oh!

ubergarm
Mar 14, 2025

`ik_llama.cpp`

`Q8_0`

ubergarm `Q2_K_R4`

ubergarm `Q2_K_R4` with various `-ser N,1`

ubergarm `IQ2_BN_R4`

ubergarm `IQ2_K_R4`

Replies: 30 comments 161 replies

ubergarm
Mar 14, 2025
Author

saood06 Mar 15, 2025
Collaborator

saood06 Jul 13, 2025
Collaborator

ikawrakow
Mar 15, 2025
Maintainer

saood06
Mar 16, 2025
Collaborator

ubergarm Mar 19, 2025
Author

ubergarm Mar 20, 2025
Author

saood06
Mar 20, 2025
Collaborator

saood06
Mar 21, 2025
Collaborator

saood06 Mar 23, 2025
Collaborator

ubergarm Mar 23, 2025
Author

saood06 Mar 23, 2025
Collaborator

ikawrakow Mar 23, 2025
Maintainer

ikawrakow Mar 23, 2025
Maintainer

ikawrakow
Mar 21, 2025
Maintainer

ubergarm Mar 21, 2025
Author

ikawrakow
Mar 21, 2025
Maintainer