mla matrix absorbtion #599

magikRUKKOLA · 2025-07-11T11:55:28Z

magikRUKKOLA
Jul 11, 2025

As a prefill optimization for the long context as implemented in ktransformers. I found some cool docs. Will leave it here.

https://github.com/madsys-dev/deepseekv2-profile/blob/main/workspace/blog/optimizing-mla.md

deepseek r1 explanation:

The matrix absorption technique in DeepSeek-V2's MLA (Multi-head Latent Attention) mechanism is a clever mathematical optimization that avoids explicitly decompressing the compressed KV cache, significantly reducing computation and memory overhead. Here's a step-by-step explanation:

1. Core Problem

Traditional MLA implementations:

Store compressed KV representations (small memory footprint)
But require decompression before attention calculation: $$k_t^C = W^{UK} \cdot c_t^{KV} \quad \text{(expensive operation)}$$ $$v_t = W^{UV} \cdot c_t^{KV} \quad \text{(expensive operation)}$$

2. Key Insight: Matrix Associativity

Matrix multiplication is associative. Instead of decompressing KV, absorb the decompression matrices into adjacent operations:

K-absorption: Fuse decompression into Q projection
V-absorption: Fuse decompression into output projection

3. K-Absorption (for Attention Scores)

Original computation for non-RoPE attention scores:

$${q_t^C}^\top k_t^C = (W^{UQ} c_t^Q)^\top (W^{UK} c_t^{KV})$$

Absorbed version using associativity:

$${q_t^C}^\top k_t^C = \underbrace{(c_t^Q)^\top}_{\text{input}} \cdot \underbrace{(W^{UQ})^\top W^{UK}}_{\text{precomputed}} \cdot \underbrace{c_t^{KV}}_{\text{cached}}$$

Why this helps:

Avoids explicit computation of full-dimensional k_t^C
Replaces large matrix multiplication with smaller operations
FLOPs reduction: From 33.64 MFLOP/token → 0.28 MFLOP/token

4. V-Absorption (for Attention Output)

Original computation:

$$o = a \cdot v_t = a \cdot (W^{UV} \cdot c_t^{KV})$$

$$u = W^O \cdot o$$

Absorbed version using Einstein summation:

# Instead of decompressing V explicitly:
attn_output = torch.einsum('bhql,blc->bhqc', attn_weights, compressed_kv)  # Weighted sum in compressed space
attn_output = torch.einsum('bhqc,hdc->bhqd', attn_output, W_UV)            # Late decompression
u = torch.einsum('hdD,bhqd->bqD', W_O, attn_output)                        # Output projection

Why this helps:

Avoids materializing full v_t (128× larger than c_t^{KV})
Fuses decompression with weighted sum
Minimizes intermediate memory

5. Move Elision Optimization

Additional problem: Original code concatenates RoPE/non-RoPE parts of Q/K, creating large temporary tensors.

Solution: Compute attention scores in two parts and sum:

attn_weights = (
    torch.matmul(q_pe, k_pe.transpose(2, 3))   # RoPE part
    + torch.einsum('bhqc,blc->bhql', q_nope, compressed_kv)  # Non-RoPE (absorbed)
)

Eliminates memory-hungry concatenation ops
Avoids storing full Q/K tensors (192-dimensional)

6. Why Not Pre-Absorb All Matrices?

Theoretically, you could precompute:

$$W_{\text{new}}^{UQ} = (W^{UQ})^\top W^{UK} \quad \text{and} \quad W_{\text{new}}^O = W^O W^{UV}$$

But this is inefficient because:

W_{\text{new}}^{UQ} would be a large low-rank matrix (1536×512)
W_{\text{new}}^O would be massive (5120×512 per head)
Actual implementation (sequential small ops) is faster:

7. Performance Impact

Optimization	KV Cache Size	Compute (FLOPs/token)	Speedup (vs baseline)
Baseline (CacheDecompressed)	81.92 KB	0.08 M	1.0×
CacheCompressed (no absorption)	1.15 KB	33.64 M	0.3×
Absorption + Move Elision	1.15 KB	0.28 M	1.2×

Key wins:

71× smaller KV cache than decompressed version
120× less computation than non-absorbed compression
Enables larger batch sizes (critical for GPU efficiency)

Summary

Matrix absorption works by:

Exploiting associativity to "push" decompression matrices into adjacent operations
Late decompression of V during output projection
Splitting attention scores to avoid tensor concatenation
Leveraging compressed representations throughout computation

This transforms MLA from a memory-bound problem into a compute-bound one, better utilizing modern GPU tensor cores while maintaining 98.6% KV cache compression.

ikawrakow · 2025-07-11T12:16:34Z

ikawrakow
Jul 11, 2025
Maintainer

@magikRUKKOLA

You may want to check #246, #260, #273.

As far as I can tell, #246, which explains the basic idea of reducing the amount of multiply-adds when using MLA, precedes the linked doc by about a month, and is surprisingly similar to what they wrote.

#260 explains the -amb option, which limits the amount of intermediate compute buffer storage required.

#273 is the best MLA version in ik_llama.cpp. The MLA=2 variant (explained in #246) is used for prompt processing, the original MLA (MLA=1) is used for token generation. The main reason it took a while to arrive at #273 was the struggle to implement the MLA=1 case efficiently on CUDA (and the struggle was due to the much larger than usual attention head sizes of 576 and 512).

If you look at all merged PRs, you will see that it has been quite a journey to arrive at what we have today for doing fast DeepSeek inference.

0 replies

ubergarm · 2025-07-11T22:10:57Z

ubergarm
Jul 11, 2025

A new model with MLA just dropped only 1000B-A32B https://huggingface.co/moonshotai/Kimi-K2-Instruct .... 😭 lol...

7 replies

ubergarm Jul 12, 2025

@ewhacc

I haven't looked to see if existing methods for going from fp8 safetensors to bf16 GGUFs would work on that model yet. I use the evshrion llama.cpp fork (from fairydreamings original MLA fork) plus triton-cpu to convert deepseek 671B without a GPU on a big RAM box. That is the first challenge.

Next you'll need over 1TB RAM to inference the Q8_0 to make an imatrix. I don't have access to the big RAM box right now, so I can't do this step at the moment. Plus its a pain to free up like 4TB disk space lol...

Keep us posted, I'm sure people will want to run this monster eventually

ubergarm Jul 12, 2025

@magikRUKKOLA

I suggest that the setup how the tool usage can be applied with ik_llama.cpp should be documented somewhere. Basically we need a MITM-tool to translate JSON<->TOOL_CALL_TOKENS. And that's about it.

One guy put together a function calling wrapper thing, not sure if it is applicable here: #407 (comment)

I haven't tried it personally.

magikRUKKOLA Jul 12, 2025
Author

@magikRUKKOLA

One guy put together a function calling wrapper thing, not sure if it is applicable here: #407 (comment)

Yeah, I noticed. I suggest some docs should be created on how to provide a frontend for the ik_llama.cpp to support the tool calling. But first let me observe what solution would be the most elegant.

magikRUKKOLA Jul 12, 2025
Author

@ewhacc

I have 512G RAM and would like to test IQ2.

I just noticed that IQ4_KS_R4 of Deepseek R1 is 368 GiB. So

echo "scale=2;368*(1000/671)"|bc
548.32

So the kimi k2 with a similar quant might fit within the 512 GB RAM. Or, the IQ3 quant should fit.

But... but... something should be done with the attention mechanism (for the prefill) to reduce the VRAM usage. I am currently looking at flashinfer. That is the exact reason of instability in ktransofmers. Its a hurdle. :)

I thought 256G is the best because using 512G (with higher bits) is too slow. I was wrong.

Yeah, I made a same mistake.
Small tip/note -- it you chose to use DDR4 don't buy 3200 MT/s (unless its for Lenovo machines). The Samsung 2666 MT/s ECC overclocks with 1.35V great with crazy timings. But you would have to install the additional fans and the heatsinks on top of the RAM. Also, Gigabyte MC62-G40-00 suck -- it doesn't allow overclocking.

magikRUKKOLA Jul 13, 2025
Author

621GB Q4_K quant dropped!

https://huggingface.co/KVCache-ai/Kimi-K2-Instruct-GGUF

Can't wait for the Q3 quant to try out on 512GB RAM. :) Also setting up the water cooling for the four RTX 3090 to be able to connect [four of them] without the risers (to support as much context as possible).

ewhacc · 2025-07-13T11:25:36Z

ewhacc
Jul 13, 2025

@ubergarm

I haven't looked to see if existing methods for going from fp8 safetensors to bf16 GGUFs would work on that model yet. I use the evshrion llama.cpp fork (from fairydreamings original MLA fork) plus triton-cpu to convert deepseek 671B without a GPU on a big RAM box. That is the first challenge.

I just tried fp8_cast_bf16.py but got VRAM OOM. I didn't think this will be big challenge but 1st one is getting tough. I will try with more VRAM, and perhaps will try evshrion llama.cpp too. Thanks a lot for help. I'm just giving a try your recipes.

Next you'll need over 1TB RAM to inference the Q8_0 to make an imatrix.

Hmm, this one is what I worried and wanted to ask. Well, time to wake my xeon box (it's too loud). BTW, isn't it possible to make imatrix directly from BF16? Making Q8_0 is a must? Ha ha, it's a long and big way to go. FP8 -> BF16 -> Q8_0 -> imatrix -> Q2

Edit: I'm trying evshiron llama.cpp, which seems to have a direct conversion from fp8 to q8_0.

Edit: Failed to get q8_0. I don't know it needs 1T RAM, but seems not a RAM problem (tried on 512M)
python ev_llama.cpp/convert_hf_to_gguf.py Kimi-K2-Instruct --outfile Kimi-K2-Instruct-q8 --outtype q8_0
ValueError: Pointer argument (at 0) cannot be accessed from Triton (cpu tensor?)

6 replies

magikRUKKOLA Jul 13, 2025
Author

I don't have access to enough RAM at the moment. Maybe will in the next few weeks 🤞

Hey bro, are you in EU? I can drop you some 1TB DDR5 RAM with a huge discount.

ubergarm Jul 13, 2025

@magikRUKKOLA

Oh man, thanks for the offer, no I'm in east coast usa currently. wendell at level1techs.com is hooking me up with access to a new remote rig he's assembling that is a big dual socket 1.5TB beast that should be online sooner than I expected!

ewhacc Jul 14, 2025

@ubergarm
You had have gone through all this tough process. Thank so much for sharing experience.

Yes, if you can run inferencing with the 2TB VRAM+RAM bf16 GGUF, then you could use it directly for imatrix. I haven't tested the quality difference in terms of perplexity, but I believe the Q8_0 is sufficient given it is quite similar to the native fp8.

Oops, 2TB. Sounds like going through Q8_0 is a must.

ubergarm Jul 14, 2025

@ewhacc

So Wendell just hooked me up with remote access to a big dual socket AMD CPU rig with 42TB of kioxia flash storage i put into two RAID0 arrays and with almost 1.5TB RAM - (no GPUs). So working through it now using the "mainline" method of casting the fp8 safetensors to bf16 safetensors first.

If I can get that working, I'll try to see if it is possible to adapt the evshiron fork to do the same MLA treatment to Kimi-K2 as it does for deepseek models and do the direct fp8 safetensors -> bf16 GGUF

A few folks working on it also here feel free to join with your findings: https://huggingface.co/gabriellarson/Kimi-K2-Instruct-GGUF/discussions/1

ewhacc Jul 14, 2025

@ubergarm

A few folks working on it also here feel free to join with your findings: https://huggingface.co/gabriellarson/Kimi-K2-Instruct-GGUF/discussions/1

Thanks for inviting. I see you already started there :)

ewhacc · 2025-07-13T11:30:20Z

ewhacc
Jul 13, 2025

@magikRUKKOLA

Small tip/note -- it you chose to use DDR4 don't buy 3200 MT/s (unless its for Lenovo machines). The Samsung 2666 MT/s ECC overclocks with 1.35V great with crazy timings. But you would have to install the additional fans and the heatsinks on top of the RAM. Also, Gigabyte MC62-G40-00 suck -- it doesn't allow overclocking.

Thank you for the tip. Yeah, I have temped to overclock DDR4, and even DDR5. But, I have to check my board allow it. Yes, RAM also needs cooling, my DDR5 gets hot when I use R1.

0 replies

magikRUKKOLA · 2025-07-15T19:59:27Z

magikRUKKOLA
Jul 15, 2025
Author

ATTN! Below is not a joke. Its an actual latest commit for the flashinfer. Please pay attention:

-        return self.run_return_lse(q, paged_kv_cache, k_scale, v_scale)
+        return self.run_return_lse(q, paged_kv_cache, k_scale=k_scale, v_scale=v_scale)

Lets read the explanation:

fix: correctly pass k_scale and v_scale to run() in forward_return_lse

MORE!

Bug Fix: Corrected an issue in BatchPrefillWithPagedKVCacheWrapper.forward_return_lse where k_scale and v_scale were incorrectly passed as positional arguments instead of keyword arguments to run_return_lse(). This resolves a **silent misbehavior or potential runtime error** caused by functools.partialmethod expecting keyword-only arguments.

the comments from the maintainer!!

Great catch, left some comments for suggestions :)

I mean, this doesn't make sense. I am not really sure its real.

0 replies

mla matrix absorbtion #599

Uh oh!

1. Core Problem

2. Key Insight: Matrix Associativity

3. K-Absorption (for Attention Scores)

4. V-Absorption (for Attention Output)

5. Move Elision Optimization

6. Why Not Pre-Absorb All Matrices?

7. Performance Impact

Summary

Replies: 5 comments · 13 replies

Uh oh!

ikawrakow Jul 11, 2025 Maintainer

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

magikRUKKOLA Jul 12, 2025 Author

Uh oh!

Uh oh!

magikRUKKOLA Jul 12, 2025 Author

Uh oh!

Uh oh!

magikRUKKOLA Jul 13, 2025 Author

Uh oh!

Uh oh!

Uh oh!

magikRUKKOLA Jul 13, 2025 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

magikRUKKOLA Jul 15, 2025 Author

Replies: 5 comments 13 replies

ikawrakow
Jul 11, 2025
Maintainer

magikRUKKOLA Jul 12, 2025
Author

magikRUKKOLA Jul 12, 2025
Author

magikRUKKOLA Jul 13, 2025
Author

magikRUKKOLA Jul 13, 2025
Author

magikRUKKOLA
Jul 15, 2025
Author