Various #181

ikawrakow · 2025-01-29T11:53:30Z

PR started by me adding the -gp option to llama-bench as per ggml-org/llama.cpp#11126 because I wanted to test TG performance after a long prompt to be able to compare to the MLA attention implementation in ggml-org/llama.cpp#11446.

But then I noticed that the repacked Q8_0 and Q4_0 quants do not work for row tensor sizes that are not a multiple of 128 (4 x block size of 32), which is the case for some of the tensors in Deepseek2-Lite that I used for testing, so I fixed that.

And than I was comparing performance after the fix on Llama-3.2-1B, and noticed that FA with Q8_0 K-cache does not work. Llama-3.2-1B has a head size of 64 and there was a comment in the code that Q8_0 does not work for a head sizes less than 128, so I fixed that as well.

Similar to pg, but it only looks at TG speed with a given prompt length.

They still need to be divisible by 32.

.. on NEON

.., on AVX2

... on NEON

... on Zen4. Also fix q8_0 K-cache for head sizes that are not multiple of 128.

* Slightly faster AVX2 implementation for q4_k_r4 * Even better AVX2 implementation for q4_k_r4 We now arrive at PP-512 = 328 t/s for LLaMA-3.1-8B on a Ryzen-5975WX CPU, up from 291 t/s when I last measured on 3c5f872. With FA and Q8_0 K-cache we get to 339.5 t/s. * Fix llama-bench labels that I broke with #181 * Faster AVX2 implementation for q5_k_q4 We arrive at 302 t/s for LLaMA-3.1-8B on a Ryzen-5975WX CPU, up from 273 t/s. * Use AVX2 implementation of q4_k_r4 and q5_k_r4 also on Zen4 After the changes I made to AVX2, it ends up being slightly faster compared to what I had for Zen4. * Minor tweak * Cleanup --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

Iwan Kawrakow added 7 commits January 28, 2025 11:56

Adding gp option to llama-bench

45de9c8

Similar to pg, but it only looks at TG speed with a given prompt length.

Make q8_0_r4 work with tensor row sizes that are not a multiple of 128

3c974f5

They still need to be divisible by 32.

Make q8_0_r4 work with tensor row sizes that are not a multiple of 128

d354568

.. on NEON

Make q8_0_r4 work with tensor row sizes that are not a multiple of 128

4d7dc72

.., on AVX2

Make q4_0_r4 work with tensor row sizes that are not a multiple of 128

3b46d3a

.., on AVX2

Make q4_0_r4 work with tensor row sizes that are not a multiple of 128

80ef713

... on NEON

Make q4_0_r4 work with tensor row sizes that are not a multiple of 128

23e90dc

... on Zen4. Also fix q8_0 K-cache for head sizes that are not multiple of 128.

ikawrakow merged commit 4a73c25 into main Jan 29, 2025

ikawrakow pushed a commit that referenced this pull request Jan 29, 2025

Fix llama-bench labels that I broke with #181

d07ba66

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Various #181

Various #181

Uh oh!

ikawrakow commented Jan 29, 2025

Uh oh!

Uh oh!

Various #181

Various #181

Uh oh!

Conversation

ikawrakow commented Jan 29, 2025

Uh oh!

Uh oh!