-
Notifications
You must be signed in to change notification settings - Fork 97
WIP Compute per layer LIM Scores during imatrix #326
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
*WARNING*: This is mostly vibe code. Hope I'm not wasting y'alls time. Compute Layer Importance Modification (LIM) Scores The goal of this PR is to rank layers of a given tensor in order of sensitivity to quantization error. Given that it is now possible to use `llama-quantize --custom-q ...` regex, it may be possible to use these LIM Scores to decide which layers of a given tensor to quantize more or less in an attempt to preserve generation quality (e.g. low perplexity) while reducing memory footprint as compared to using same quant size across all layers of a given tensor. This experimental PR was motivated by this comment and PR: ggml-org/llama.cpp#12718 I may force-push this after more testing and experimenting to see if it is actually doing the right thing and if the output is actually useful to improve quantization quality e.g. PPL per GiB... This may just be a big mistake, lol. This is built on existing imatrix computation and assumes that values of `x[j]` are the "activations" coming right in/out of the given tensor layer. I don't know GGML and generally work in python or vanilla c not so much c++. So a lot of this was vibe coded running [ubergarm/DeepSeek-V3-0324-GGUF IQ4_K_R4 quant](https://huggingface.co/ubergarm/DeepSeek-V3-0324-GGUF/tree/main/DeepSeek-V3-0324-IQ4_K_R4). So this is partially an experiment actually trying to use an LLM instead of just enjoying the meta of manual quantization min-maxing. ``` @misc{dumitru2024layerwisequantizationpragmaticeffective, title={Layer-Wise Quantization: A Pragmatic and Effective Method for Quantizing LLMs Beyond Integer Bit-Levels}, author={Razvan-Gabriel Dumitru and Vikas Yadav and Rishabh Maheshwary and Paul-Ioan Clotan and Sathwik Tejaswi Madhusudhan and Mihai Surdeanu}, year={2024}, eprint={2406.17415}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2406.17415}, code={https://github.com/RazvanDu/LayerwiseQuant/}, } ```
Do I understand the results in the quoted PR correctly? The I didn't go to read the blog post, but why would cosine similarity between the inputs of two subsequent layers measure layer importance? |
@@ -198,6 +205,7 @@ bool IMatrixCollector::collect_imatrix(struct ggml_tensor * t, bool ask, void * | |||
for (int row = 0; row < (int)(src1->ne[1]*src1->ne[2]); ++row) { | |||
const float * x = data + row * src1->ne[0]; | |||
for (int j = 0; j < (int)src1->ne[0]; ++j) { | |||
e.activations[j] = x[j]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, activations
gets overwritten each time we get called with a new set of activations. It also gets overwritten as we go over the rows of the activation matrix. At the end of the run, the compute_lim()
function gets called. Which means that we get the LIM computed with just the very last token processed in the imatrix
run, not an actual statistical evaluation of cosine similarities between inputs to tensors of the same type in subsequent layers.
Correct, the summary of the rest of that PR thread including the specific comment by @compilade point out issues with that initial experiment and suggest it may be possible to implement the cosine similarity estimate of relative layer importance in
The paper that suggests using cosine similarity says:
I'll hack around some more to see if I can fix the implementation to possibly do a "running cosine similarity" given the naive first attempt is not properly doing a statistical evaluation across all the input tokens. The paper suggests another possible method of measuring relative layer sensitivity that I didn't try. Maybe one could calculate the "condition numbers" or "max stretch" for each layer's tensor and rank them, just wildly spit-balling beyond my pay grade xD... Really appreciate your time, thanks! |
Sure. But the activations did not change due to that tensor only, they changed due to all tensors in the preceding layer. Or more precisely, activations changed due to the tensor we are considering, plus all tensors with their linear and non-linear operations that followed, before arriving at the same tensor type in the next layer. If the changes in the activations were trivially predictable, people wouldn't be doing complicated networks, and wouldn't be experimenting around with GELU's, RELU's, SILU's, variations of RoPE, different combinations of activation normalizations, and all that jazz. I can see looking at the activation change between whole layers to derive an estimate of how important the entire layer was, but claiming that the difference in activation input to a specific tensor type between two consecutive layers is a measure of how important this specific tensor type is? That's pushing it. |
I agree with @ikawrakow, comparing across layers for a particular tensor seems like it would have non-intuitive results which might not necessarily be linked to relative importance of the tensors. I think what is calculated here is the cosine similarity across the inputs of between consecutive layers of each linear operations in the model(s). It's not particularly clear how this information can be used.
@ubergarm What I meant by this was to calculate LIM scores with the input and output within each linear operations (i.e. what |
Can you be more specific how you want to calculate the impact of a linear operation from the input activations and the result of the linear operation? I have used this to derive corrections for a quantized model (have not published, it is in a private repository where I experiment with stuff). But I don't really see how one can derive tensor importance scores from that. |
@ikawrakow I might not have thought this through properly. I was thinking of directly calculating a dot product between the input and output of each matmul (and normalizing) to get LIM scores by negating that, but this would only work for square matrices (where the input and output have the same shape). |
Closing this in favor of implementation in PR#328. ExperimentStill more experimentation to do, and sorry no visual graphs as I'm away from my desk, but did a quick A/B test comparing two Finally, I provide the tl;dr;Using PR#328
While it is within the noise, there may be room for further improvement applying the scores to attention tensors quantization as well which I didn't do for this experiment. In retrospect, I probably should have used the layer importance scores from ProcedureCompute imatrix and layer similarity scores using `V3-0324` `q8_0`$ numactl -N 1 -m 1 \
./build/bin/llama-imatrix \
--verbosity 1 \
--layer-similarity \
-m /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-Q8_0.gguf \
-f calibration_data_v5_rc.txt \
-o /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/imatrix-ubergarm-DeepSeek-V3-0324-ik_llamacpp-$(git rev-parse --short HEAD).dat \
--ctx-size 512 \
--numa numactl \
--threads 128
llama_model_loader: loaded meta data with 46 key-value pairs and 1147 tensors from /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-Q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = deepseek2
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = DeepSeek V3 0324
llama_model_loader: - kv 3: general.version str = V3-0324
llama_model_loader: - kv 4: general.basename str = DeepSeek
llama_model_loader: - kv 5: general.size_label str = 256x21B
llama_model_loader: - kv 6: general.license str = mit
llama_model_loader: - kv 7: deepseek2.block_count u32 = 61
llama_model_loader: - kv 8: deepseek2.context_length u32 = 163840
llama_model_loader: - kv 9: deepseek2.embedding_length u32 = 7168
llama_model_loader: - kv 10: deepseek2.feed_forward_length u32 = 18432
llama_model_loader: - kv 11: deepseek2.attention.head_count u32 = 128
llama_model_loader: - kv 12: deepseek2.attention.head_count_kv u32 = 128
llama_model_loader: - kv 13: deepseek2.rope.freq_base f32 = 10000.000000
llama_model_loader: - kv 14: deepseek2.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 15: deepseek2.expert_used_count u32 = 8
llama_model_loader: - kv 16: general.file_type u32 = 7
llama_model_loader: - kv 17: deepseek2.leading_dense_block_count u32 = 3
llama_model_loader: - kv 18: deepseek2.vocab_size u32 = 129280
llama_model_loader: - kv 19: deepseek2.attention.q_lora_rank u32 = 1536
llama_model_loader: - kv 20: deepseek2.attention.kv_lora_rank u32 = 512
llama_model_loader: - kv 21: deepseek2.attention.key_length u32 = 192
llama_model_loader: - kv 22: deepseek2.attention.value_length u32 = 128
llama_model_loader: - kv 23: deepseek2.expert_feed_forward_length u32 = 2048
llama_model_loader: - kv 24: deepseek2.expert_count u32 = 256
llama_model_loader: - kv 25: deepseek2.expert_shared_count u32 = 1
llama_model_loader: - kv 26: deepseek2.expert_weights_scale f32 = 2.500000
llama_model_loader: - kv 27: deepseek2.expert_weights_norm bool = true
llama_model_loader: - kv 28: deepseek2.expert_gating_func u32 = 2
llama_model_loader: - kv 29: deepseek2.rope.dimension_count u32 = 64
llama_model_loader: - kv 30: deepseek2.rope.scaling.type str = yarn
llama_model_loader: - kv 31: deepseek2.rope.scaling.factor f32 = 40.000000
llama_model_loader: - kv 32: deepseek2.rope.scaling.original_context_length u32 = 4096
llama_model_loader: - kv 33: deepseek2.rope.scaling.yarn_log_multiplier f32 = 0.100000
llama_model_loader: - kv 34: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 35: tokenizer.ggml.pre str = deepseek-v3
llama_model_loader: - kv 36: tokenizer.ggml.tokens arr[str,129280] = ["
llama_model_loader: - kv 37: tokenizer.ggml.token_type arr[i32,129280] = [3
llama_model_loader: - kv 38: tokenizer.ggml.merges arr[str,127741] = ["
llama_model_loader: - kv 39: tokenizer.ggml.bos_token_id u32 = 0
llama_model_loader: - kv 40: tokenizer.ggml.eos_token_id u32 = 1
llama_model_loader: - kv 41: tokenizer.ggml.padding_token_id u32 = 1
llama_model_loader: - kv 42: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 43: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 44: tokenizer.chat_template str = {% if not add_generation_prompt is de...
llama_model_loader: - kv 45: general.quantization_version u32 = 2
llama_model_loader: - type f32: 361 tensors
llama_model_loader: - type q8_0: 786 tensors
llm_load_vocab: special tokens cache size = 818
llm_load_vocab: token to piece cache size = 0.8223 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = deepseek2
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 129280
llm_load_print_meta: n_merges = 127741
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 163840
llm_load_print_meta: n_embd = 7168
llm_load_print_meta: n_layer = 61
llm_load_print_meta: n_head = 128
llm_load_print_meta: n_head_kv = 128
llm_load_print_meta: n_rot = 64
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 192
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: n_embd_k_gqa = 24576
llm_load_print_meta: n_embd_v_gqa = 16384
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-06
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 18432
llm_load_print_meta: n_expert = 256
llm_load_print_meta: n_expert_used = 8
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = yarn
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 0.025
llm_load_print_meta: n_ctx_orig_yarn = 4096
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = 671B
llm_load_print_meta: model ftype = Q8_0
llm_load_print_meta: model params = 672.050 B
llm_load_print_meta: model size = 665.308 GiB (8.504 BPW)
llm_load_print_meta: repeating layers = 663.474 GiB (8.504 BPW, 670.196 B parameters)
llm_load_print_meta: general.name = DeepSeek V3 0324
llm_load_print_meta: BOS token = 0 '<|begin▁of▁sentence|>'
llm_load_print_meta: EOS token = 1 '<|end▁of▁sentence|>'
llm_load_print_meta: PAD token = 1 '<|end▁of▁sentence|>'
llm_load_print_meta: LF token = 131 'Ä'
llm_load_print_meta: max token length = 256
llm_load_print_meta: n_layer_dense_lead = 3
llm_load_print_meta: n_lora_q = 1536
llm_load_print_meta: n_lora_kv = 512
llm_load_print_meta: n_ff_exp = 2048
llm_load_print_meta: n_expert_shared = 1
llm_load_print_meta: expert_weights_scale = 2.5
llm_load_print_meta: expert_weights_norm = 1
llm_load_print_meta: expert_gating_func = sigmoid
llm_load_print_meta: rope_yarn_log_mul = 0.1000
llm_load_tensors: ggml ctx size = 0.47 MiB
llm_load_tensors: CPU buffer size = 681274.97 MiB
....................................................................................................
llama_new_context_with_model: n_ctx = 512
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: mla_attn = 0
llama_new_context_with_model: attn_max_b = 0
llama_new_context_with_model: fused_moe = 0
llama_new_context_with_model: ser = -1, 0
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 0.025
llama_kv_cache_init: CPU KV buffer size = 2440.00 MiB
llama_new_context_with_model: KV self size = 2440.00 MiB, K (f16): 1464.00 MiB, V (f16): 976.00 MiB
llama_new_context_with_model: CPU output buffer size = 0.49 MiB
llama_new_context_with_model: CPU compute buffer size = 283.01 MiB
llama_new_context_with_model: graph nodes = 3724
llama_new_context_with_model: graph splits = 1
system_info: n_threads = 128 / 512 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
compute_imatrix: tokenizing the input ..
compute_imatrix: tokenization took 309.837 ms
compute_imatrix: computing over 213 chunks with batch_size 512
compute_imatrix: 37.90 seconds per pass - ETA 2 hours 14.55 minutes
[1]60.9619,[2]10.7701,[3]5.8724,[4]3.7883,[5]2.9691,[6]2.5089,[7]2.2199,[8]2.0199,[9]1.9095,
save_imatrix: entry ' blk.60.ffn_down_exps.weight' has partial data (99.61%) 1 out of 256 experts are missing data Storing **but be aware**
save_imatrix: entry ' blk.60.ffn_gate_exps.weight' has partial data (99.61%) 1 out of 256 experts are missing data Storing **but be aware**
save_imatrix: entry ' blk.60.ffn_up_exps.weight' has partial data (99.61%) 1 out of 256 experts are missing data Storing **but be aware**
save_imatrix: entry ' blk.25.ffn_down_exps.weight' has partial data (99.61%) 1 out of 256 experts are missing data Storing **but be aware**
save_imatrix: entry ' blk.26.ffn_down_exps.weight' has partial data (99.61%) 1 out of 256 experts are missing data Storing **but be aware**
save_imatrix: entry ' blk.25.ffn_up_exps.weight' has partial data (99.61%) 1 out of 256 experts are missing data Storing **but be aware**
save_imatrix: entry ' blk.25.ffn_gate_exps.weight' has partial data (99.61%) 1 out of 256 experts are missing data Storing **but be aware**
save_imatrix: entry ' blk.26.ffn_gate_exps.weight' has partial data (99.61%) 1 out of 256 experts are missing data Storing **but be aware**
save_imatrix: entry ' blk.26.ffn_up_exps.weight' has partial data (99.61%) 1 out of 256 experts are missing data Storing **but be aware**
save_imatrix: stored collected data after 10 chunks in /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/imatrix-ubergarm-DeepSeek-V3-0324-ik_llamacpp-f7c5a94e.dat
[10]1.8219,[11]2.0296,[12]2.0839,[13]2.0978,[14]2.1403,[15]2.0365,[16]1.9492,[17]1.8786,[18]1.8160,[19]1.7743,
save_imatrix: stored collected data after 20 chunks in /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/imatrix-ubergarm-DeepSeek-V3-0324-ik_llamacpp-f7c5a94e.dat
[20]1.7315,[21]1.6986,[22]1.6609,[23]1.6319,[24]1.6201,[25]1.6080,[26]1.5822,[27]1.6812,[28]1.7547,[29]1.8204,
save_imatrix: stored collected data after 30 chunks in /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/imatrix-ubergarm-DeepSeek-V3-0324-ik_llamacpp-f7c5a94e.dat
[30]1.8188,[31]1.8323,[32]1.8317,[33]1.8091,[34]1.8457,[35]1.8217,[36]1.8215,[37]1.8106,[38]1.8208,[39]1.8070,
save_imatrix: stored collected data after 40 chunks in /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/imatrix-ubergarm-DeepSeek-V3-0324-ik_llamacpp-f7c5a94e.dat
[40]1.7838,[41]1.7606,[42]1.7410,[43]1.7291,[44]1.7157,[45]1.7023,[46]1.6981,[47]1.6919,[48]1.6811,[49]1.6707,
save_imatrix: stored collected data after 50 chunks in /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/imatrix-ubergarm-DeepSeek-V3-0324-ik_llamacpp-f7c5a94e.dat
[50]1.6650,[51]1.6623,[52]1.6625,[53]1.6672,[54]1.6812,[55]1.6781,[56]1.6683,[57]1.6764,[58]1.6796,[59]1.6906,
save_imatrix: stored collected data after 60 chunks in /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/imatrix-ubergarm-DeepSeek-V3-0324-ik_llamacpp-f7c5a94e.dat
[60]1.6855,[61]1.7243,[62]1.7565,[63]1.7884,[64]1.8197,[65]1.8677,[66]1.8802,[67]1.9148,[68]1.9442,[69]1.9996,
save_imatrix: stored collected data after 70 chunks in /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/imatrix-ubergarm-DeepSeek-V3-0324-ik_llamacpp-f7c5a94e.dat
[70]2.0525,[71]2.0832,[72]2.1136,[73]2.1258,[74]2.1407,[75]2.1702,[76]2.2011,[77]2.2185,[78]2.2164,[79]2.2313,
save_imatrix: stored collected data after 80 chunks in /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/imatrix-ubergarm-DeepSeek-V3-0324-ik_llamacpp-f7c5a94e.dat
[80]2.2543,[81]2.2904,[82]2.3238,[83]2.3342,[84]2.3650,[85]2.3733,[86]2.3730,[87]2.4024,[88]2.4344,[89]2.4899,
save_imatrix: stored collected data after 90 chunks in /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/imatrix-ubergarm-DeepSeek-V3-0324-ik_llamacpp-f7c5a94e.dat
[90]2.5102,[91]2.5125,[92]2.5192,[93]2.5349,[94]2.5452,[95]2.5779,[96]2.5670,[97]2.6058,[98]2.6319,[99]2.6214,
save_imatrix: stored collected data after 100 chunks in /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/imatrix-ubergarm-DeepSeek-V3-0324-ik_llamacpp-f7c5a94e.dat
[100]2.6537,[101]2.7008,[102]2.7326,[103]2.7740,[104]2.8020,[105]2.8310,[106]2.8682,[107]2.8605,[108]2.8789,[109]2.8849,
save_imatrix: stored collected data after 110 chunks in /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/imatrix-ubergarm-DeepSeek-V3-0324-ik_llamacpp-f7c5a94e.dat
[110]2.8910,[111]2.8878,[112]2.9177,[113]2.9435,[114]2.9520,[115]2.9363,[116]2.9104,[117]2.9044,[118]2.9147,[119]2.9003,
save_imatrix: stored collected data after 120 chunks in /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/imatrix-ubergarm-DeepSeek-V3-0324-ik_llamacpp-f7c5a94e.dat
[120]2.8773,[121]2.8737,[122]2.8738,[123]2.8819,[124]2.8872,[125]2.8942,[126]2.9018,[127]2.9043,[128]2.9343,[129]2.9484,
save_imatrix: stored collected data after 130 chunks in /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/imatrix-ubergarm-DeepSeek-V3-0324-ik_llamacpp-f7c5a94e.dat
[130]2.9241,[131]2.9003,[132]2.8771,[133]2.8544,[134]2.8563,[135]2.8567,[136]2.8828,[137]2.9150,[138]2.9340,[139]2.9389,
save_imatrix: stored collected data after 140 chunks in /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/imatrix-ubergarm-DeepSeek-V3-0324-ik_llamacpp-f7c5a94e.dat
[140]2.9637,[141]2.9866,[142]3.0151,[143]3.0354,[144]3.0569,[145]3.0766,[146]3.0972,[147]3.1154,[148]3.1266,[149]3.1351,
save_imatrix: stored collected data after 150 chunks in /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/imatrix-ubergarm-DeepSeek-V3-0324-ik_llamacpp-f7c5a94e.dat
[150]3.1395,[151]3.1572,[152]3.1761,[153]3.1759,[154]3.1834,[155]3.1945,[156]3.2035,[157]3.2148,[158]3.2209,[159]3.2300,
save_imatrix: stored collected data after 160 chunks in /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/imatrix-ubergarm-DeepSeek-V3-0324-ik_llamacpp-f7c5a94e.dat
[160]3.2442,[161]3.2498,[162]3.2525,[163]3.2595,[164]3.2704,[165]3.2724,[166]3.2737,[167]3.2912,[168]3.3010,[169]3.3082,
save_imatrix: stored collected data after 170 chunks in /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/imatrix-ubergarm-DeepSeek-V3-0324-ik_llamacpp-f7c5a94e.dat
[170]3.3258,[171]3.3403,[172]3.3354,[173]3.3417,[174]3.3424,[175]3.3575,[176]3.3691,[177]3.3818,[178]3.3768,[179]3.3734,
save_imatrix: stored collected data after 180 chunks in /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/imatrix-ubergarm-DeepSeek-V3-0324-ik_llamacpp-f7c5a94e.dat
[180]3.3682,[181]3.3635,[182]3.3578,[183]3.3531,[184]3.3472,[185]3.3600,[186]3.3887,[187]3.4121,[188]3.4336,[189]3.4550,
save_imatrix: stored collected data after 190 chunks in /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/imatrix-ubergarm-DeepSeek-V3-0324-ik_llamacpp-f7c5a94e.dat
[190]3.4850,[191]3.4990,[192]3.5134,[193]3.5036,[194]3.5210,[195]3.5145,[196]3.4953,[197]3.4747,[198]3.4946,[199]3.5110,
save_imatrix: stored collected data after 200 chunks in /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/imatrix-ubergarm-DeepSeek-V3-0324-ik_llamacpp-f7c5a94e.dat
[200]3.5207,[201]3.5290,[202]3.5447,[203]3.5621,[204]3.5748,[205]3.5874,[206]3.6021,[207]3.5989,[208]3.5771,[209]3.5556,
save_imatrix: stored collected data after 210 chunks in /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/imatrix-ubergarm-DeepSeek-V3-0324-ik_llamacpp-f7c5a94e.dat
[210]3.5342,[211]3.5134,[212]3.4930,[213]3.4727,
save_imatrix: stored collected data after 213 chunks in /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/imatrix-ubergarm-DeepSeek-V3-0324-ik_llamacpp-f7c5a94e.dat
Final estimate: PPL = 3.4727 +/- 0.03300
llama_print_timings: load time = 38826.79 ms
llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: prompt eval time = 7699212.14 ms / 109056 tokens ( 70.60 ms per token, 14.16 tokens per second)
llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: total time = 7777812.63 ms / 109057 tokens
======================== sorted layer importances
0: Layer 0, <cos_sim> = 0.517453
1: Layer 60, <cos_sim> = 0.59436
2: Layer 8, <cos_sim> = 0.857555
3: Layer 3, <cos_sim> = 0.858137
4: Layer 1, <cos_sim> = 0.869657
5: Layer 59, <cos_sim> = 0.875667
6: Layer 57, <cos_sim> = 0.888417
7: Layer 5, <cos_sim> = 0.906457
8: Layer 58, <cos_sim> = 0.911674
9: Layer 7, <cos_sim> = 0.921961
10: Layer 53, <cos_sim> = 0.926514
11: Layer 22, <cos_sim> = 0.932632
12: Layer 17, <cos_sim> = 0.936935
13: Layer 24, <cos_sim> = 0.93742
14: Layer 23, <cos_sim> = 0.939419
15: Layer 4, <cos_sim> = 0.941044
16: Layer 15, <cos_sim> = 0.945621
17: Layer 25, <cos_sim> = 0.94563
18: Layer 6, <cos_sim> = 0.946055
# NOTE: i prioritized the above 17 routed expert layers [3-60] for more bpw quantization (first 0-2 layers are dense)
19: Layer 21, <cos_sim> = 0.946446
20: Layer 16, <cos_sim> = 0.947423
21: Layer 27, <cos_sim> = 0.947699
22: Layer 18, <cos_sim> = 0.948201
23: Layer 10, <cos_sim> = 0.949096
24: Layer 54, <cos_sim> = 0.949141
25: Layer 2, <cos_sim> = 0.949452
26: Layer 20, <cos_sim> = 0.949668
27: Layer 30, <cos_sim> = 0.949811
28: Layer 26, <cos_sim> = 0.951796
29: Layer 13, <cos_sim> = 0.951903
30: Layer 14, <cos_sim> = 0.952166
31: Layer 9, <cos_sim> = 0.952194
32: Layer 44, <cos_sim> = 0.952973
33: Layer 35, <cos_sim> = 0.953037
34: Layer 45, <cos_sim> = 0.953128
35: Layer 29, <cos_sim> = 0.954667
36: Layer 28, <cos_sim> = 0.954742
37: Layer 31, <cos_sim> = 0.954809
38: Layer 56, <cos_sim> = 0.955925
39: Layer 43, <cos_sim> = 0.956722
40: Layer 50, <cos_sim> = 0.958269
41: Layer 19, <cos_sim> = 0.959386
42: Layer 33, <cos_sim> = 0.95975
43: Layer 32, <cos_sim> = 0.960649
44: Layer 55, <cos_sim> = 0.960837
45: Layer 11, <cos_sim> = 0.961299
46: Layer 34, <cos_sim> = 0.961852
47: Layer 12, <cos_sim> = 0.962011
48: Layer 46, <cos_sim> = 0.962943
49: Layer 49, <cos_sim> = 0.965045
50: Layer 39, <cos_sim> = 0.96526
51: Layer 40, <cos_sim> = 0.96575
52: Layer 37, <cos_sim> = 0.967049
53: Layer 36, <cos_sim> = 0.96716
54: Layer 52, <cos_sim> = 0.967574
55: Layer 38, <cos_sim> = 0.968262
56: Layer 41, <cos_sim> = 0.968457
57: Layer 48, <cos_sim> = 0.968755
58: Layer 51, <cos_sim> = 0.968768
59: Layer 47, <cos_sim> = 0.968788
60: Layer 42, <cos_sim> = 0.971662
======================== sorted attention importances
0: Layer 0, <cos_sim> = 0.13174
1: Layer 8, <cos_sim> = 0.516951
2: Layer 11, <cos_sim> = 0.61188
3: Layer 10, <cos_sim> = 0.612091
4: Layer 12, <cos_sim> = 0.612348
5: Layer 18, <cos_sim> = 0.616718
6: Layer 16, <cos_sim> = 0.61912
7: Layer 9, <cos_sim> = 0.655522
8: Layer 13, <cos_sim> = 0.665296
9: Layer 22, <cos_sim> = 0.672061
10: Layer 6, <cos_sim> = 0.699289
11: Layer 19, <cos_sim> = 0.700966
12: Layer 20, <cos_sim> = 0.704575
13: Layer 7, <cos_sim> = 0.71001
14: Layer 14, <cos_sim> = 0.725971
15: Layer 23, <cos_sim> = 0.740926
16: Layer 25, <cos_sim> = 0.747222
17: Layer 17, <cos_sim> = 0.749419
18: Layer 15, <cos_sim> = 0.754558
19: Layer 21, <cos_sim> = 0.761675
20: Layer 24, <cos_sim> = 0.761882
21: Layer 5, <cos_sim> = 0.766086
22: Layer 2, <cos_sim> = 0.767046
23: Layer 30, <cos_sim> = 0.772412
24: Layer 1, <cos_sim> = 0.772533
25: Layer 44, <cos_sim> = 0.777696
26: Layer 29, <cos_sim> = 0.779458
27: Layer 28, <cos_sim> = 0.779721
28: Layer 37, <cos_sim> = 0.780809
29: Layer 26, <cos_sim> = 0.781589
30: Layer 4, <cos_sim> = 0.786884
31: Layer 34, <cos_sim> = 0.787128
32: Layer 36, <cos_sim> = 0.78846
33: Layer 27, <cos_sim> = 0.791454
34: Layer 31, <cos_sim> = 0.805225
35: Layer 33, <cos_sim> = 0.806554
36: Layer 57, <cos_sim> = 0.809911
37: Layer 32, <cos_sim> = 0.811714
38: Layer 38, <cos_sim> = 0.81192
39: Layer 35, <cos_sim> = 0.816966
40: Layer 41, <cos_sim> = 0.820029
41: Layer 40, <cos_sim> = 0.833644
42: Layer 3, <cos_sim> = 0.83367
43: Layer 39, <cos_sim> = 0.835849
44: Layer 42, <cos_sim> = 0.841079
45: Layer 60, <cos_sim> = 0.853526
46: Layer 45, <cos_sim> = 0.857364
47: Layer 56, <cos_sim> = 0.859897
48: Layer 59, <cos_sim> = 0.861441
49: Layer 53, <cos_sim> = 0.864087
50: Layer 46, <cos_sim> = 0.864727
51: Layer 43, <cos_sim> = 0.864848
52: Layer 51, <cos_sim> = 0.872346
53: Layer 48, <cos_sim> = 0.87434
54: Layer 52, <cos_sim> = 0.874649
55: Layer 47, <cos_sim> = 0.878183
56: Layer 58, <cos_sim> = 0.879985
57: Layer 49, <cos_sim> = 0.880846
58: Layer 55, <cos_sim> = 0.885206
59: Layer 50, <cos_sim> = 0.897436
60: Layer 54, <cos_sim> = 0.921917
======================== sorted ffn importances
0: Layer 7, <cos_sim> = 0.571293
1: Layer 10, <cos_sim> = 0.590428
2: Layer 11, <cos_sim> = 0.591834
3: Layer 17, <cos_sim> = 0.608386
4: Layer 15, <cos_sim> = 0.620593
5: Layer 0, <cos_sim> = 0.632572
6: Layer 9, <cos_sim> = 0.643826
7: Layer 12, <cos_sim> = 0.64739
8: Layer 8, <cos_sim> = 0.649753
9: Layer 21, <cos_sim> = 0.67168
10: Layer 18, <cos_sim> = 0.679443
11: Layer 19, <cos_sim> = 0.701283
12: Layer 60, <cos_sim> = 0.701407
13: Layer 13, <cos_sim> = 0.712941
14: Layer 16, <cos_sim> = 0.722858
15: Layer 24, <cos_sim> = 0.725591
16: Layer 14, <cos_sim> = 0.727539
17: Layer 22, <cos_sim> = 0.728219
18: Layer 20, <cos_sim> = 0.736531
19: Layer 6, <cos_sim> = 0.744335
20: Layer 23, <cos_sim> = 0.749712
21: Layer 29, <cos_sim> = 0.757133
22: Layer 25, <cos_sim> = 0.758496
23: Layer 5, <cos_sim> = 0.759015
24: Layer 27, <cos_sim> = 0.759242
25: Layer 28, <cos_sim> = 0.76237
26: Layer 43, <cos_sim> = 0.764705
27: Layer 36, <cos_sim> = 0.766839
28: Layer 35, <cos_sim> = 0.773264
29: Layer 26, <cos_sim> = 0.775702
30: Layer 33, <cos_sim> = 0.778872
31: Layer 32, <cos_sim> = 0.790364
32: Layer 3, <cos_sim> = 0.790503
33: Layer 30, <cos_sim> = 0.792984
34: Layer 31, <cos_sim> = 0.79496
35: Layer 37, <cos_sim> = 0.795521
36: Layer 34, <cos_sim> = 0.796573
37: Layer 56, <cos_sim> = 0.804781
38: Layer 40, <cos_sim> = 0.806738
39: Layer 59, <cos_sim> = 0.808235
40: Layer 4, <cos_sim> = 0.809825
41: Layer 1, <cos_sim> = 0.819665
42: Layer 38, <cos_sim> = 0.820409
43: Layer 39, <cos_sim> = 0.820894
44: Layer 41, <cos_sim> = 0.824874
45: Layer 44, <cos_sim> = 0.846473
46: Layer 52, <cos_sim> = 0.849335
47: Layer 42, <cos_sim> = 0.850524
48: Layer 45, <cos_sim> = 0.851349
49: Layer 55, <cos_sim> = 0.852943
50: Layer 47, <cos_sim> = 0.85862
51: Layer 50, <cos_sim> = 0.858953
52: Layer 51, <cos_sim> = 0.861418
53: Layer 58, <cos_sim> = 0.861473
54: Layer 2, <cos_sim> = 0.862156
55: Layer 57, <cos_sim> = 0.86361
56: Layer 46, <cos_sim> = 0.864787
57: Layer 48, <cos_sim> = 0.867249
58: Layer 54, <cos_sim> = 0.876651
59: Layer 49, <cos_sim> = 0.883354
60: Layer 53, <cos_sim> = 0.90793
|
WARNING: This is mostly vibe code. Hope I'm not wasting y'alls time.
Compute Layer Importance Modification (LIM) Scores
The goal of this PR is to rank layers of a given tensor in order of sensitivity to quantization error. Given that it is now possible to use
llama-quantize --custom-q ...
regex, it may be possible to use these LIM Scores to decide which layers of a given tensor to quantize more or less in an attempt to preserve generation quality (e.g. low perplexity) while reducing memory footprint as compared to using same quant size across all layers of a given tensor.This experimental PR was motivated by this comment and PR: ggml-org/llama.cpp#12718 (comment) (EDIT fixed link directly to comment)
I may force-push this after more testing and experimenting to see if it is actually doing the right thing and if the output is actually useful to improve quantization quality e.g. PPL per GiB... This may just be a big mistake, lol.
This is built on existing imatrix computation and assumes that values of
x[j]
are the "activations" coming right in/out of the given tensor layer. I don't know GGML and generally work in python or vanilla c not so much c++. So a lot of this was vibe coded running ubergarm/DeepSeek-V3-0324-GGUF IQ4_K_R4 quant. So this is partially an experiment actually trying to use an LLM instead of just enjoying the meta of manual quantization min-maxing.TODO
Qwen/CodeQwen1.5-7B-Chat-GGUF
q8_0
ubergarm/DeepSeek-V3-0324-GGUF
q8_0
--custom-q
regex and compare PPL per GiBReference
Logs
llama-imatrix run printing out what hopefully are actually LIM scores
Raw LIM Scores for all tensors and layers of `DeepSeek-V3-0324` `q8_0` GGUF
Normalized LIM Scores for all tensors and layers of `DeepSeek-V3-0324` `q8_0` GGUF