mla matrix absorbtion #599
Replies: 5 comments 13 replies
-
You may want to check #246, #260, #273. As far as I can tell, #246, which explains the basic idea of reducing the amount of multiply-adds when using MLA, precedes the linked doc by about a month, and is surprisingly similar to what they wrote. #260 explains the #273 is the best MLA version in If you look at all merged PRs, you will see that it has been quite a journey to arrive at what we have today for doing fast DeepSeek inference. |
Beta Was this translation helpful? Give feedback.
-
A new model with MLA just dropped only 1000B-A32B https://huggingface.co/moonshotai/Kimi-K2-Instruct .... 😭 lol... |
Beta Was this translation helpful? Give feedback.
-
I just tried fp8_cast_bf16.py but got VRAM OOM. I didn't think this will be big challenge but 1st one is getting tough. I will try with more VRAM, and perhaps will try evshrion llama.cpp too. Thanks a lot for help. I'm just giving a try your recipes.
Hmm, this one is what I worried and wanted to ask. Well, time to wake my xeon box (it's too loud). BTW, isn't it possible to make imatrix directly from BF16? Making Q8_0 is a must? Ha ha, it's a long and big way to go. FP8 -> BF16 -> Q8_0 -> imatrix -> Q2 Edit: I'm trying evshiron llama.cpp, which seems to have a direct conversion from fp8 to q8_0. Edit: Failed to get q8_0. I don't know it needs 1T RAM, but seems not a RAM problem (tried on 512M) |
Beta Was this translation helpful? Give feedback.
-
Thank you for the tip. Yeah, I have temped to overclock DDR4, and even DDR5. But, I have to check my board allow it. Yes, RAM also needs cooling, my DDR5 gets hot when I use R1. |
Beta Was this translation helpful? Give feedback.
-
ATTN! Below is not a joke. Its an actual latest commit for the flashinfer. Please pay attention: - return self.run_return_lse(q, paged_kv_cache, k_scale, v_scale)
+ return self.run_return_lse(q, paged_kv_cache, k_scale=k_scale, v_scale=v_scale) Lets read the explanation:
MORE!
the comments from the maintainer!!
I mean, this doesn't make sense. I am not really sure its real. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
As a prefill optimization for the long context as implemented in ktransformers. I found some cool docs. Will leave it here.
https://github.com/madsys-dev/deepseekv2-profile/blob/main/workspace/blog/optimizing-mla.md
deepseek r1 explanation:
The matrix absorption technique in DeepSeek-V2's MLA (Multi-head Latent Attention) mechanism is a clever mathematical optimization that avoids explicitly decompressing the compressed KV cache, significantly reducing computation and memory overhead. Here's a step-by-step explanation:
1. Core Problem
Traditional MLA implementations:
2. Key Insight: Matrix Associativity
Matrix multiplication is associative. Instead of decompressing KV, absorb the decompression matrices into adjacent operations:
3. K-Absorption (for Attention Scores)
Original computation for non-RoPE attention scores:
Absorbed version using associativity:
Why this helps:
k_t^C
4. V-Absorption (for Attention Output)
Original computation:
Absorbed version using Einstein summation:
Why this helps:
v_t
(128× larger thanc_t^{KV}
)5. Move Elision Optimization
Additional problem: Original code concatenates RoPE/non-RoPE parts of Q/K, creating large temporary tensors.
Solution: Compute attention scores in two parts and sum:
6. Why Not Pre-Absorb All Matrices?
Theoretically, you could precompute:
But this is inefficient because:
W_{\text{new}}^{UQ}
would be a large low-rank matrix (1536×512)W_{\text{new}}^O
would be massive (5120×512 per head)7. Performance Impact
Key wins:
Summary
Matrix absorption works by:
This transforms MLA from a memory-bound problem into a compute-bound one, better utilizing modern GPU tensor cores while maintaining 98.6% KV cache compression.
Beta Was this translation helpful? Give feedback.
All reactions