Skip to content

Various #181

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Jan 29, 2025
Merged

Various #181

merged 7 commits into from
Jan 29, 2025

Conversation

ikawrakow
Copy link
Owner

PR started by me adding the -gp option to llama-bench as per ggml-org/llama.cpp#11126 because I wanted to test TG performance after a long prompt to be able to compare to the MLA attention implementation in ggml-org/llama.cpp#11446.

But then I noticed that the repacked Q8_0 and Q4_0 quants do not work for row tensor sizes that are not a multiple of 128 (4 x block size of 32), which is the case for some of the tensors in Deepseek2-Lite that I used for testing, so I fixed that.

And than I was comparing performance after the fix on Llama-3.2-1B, and noticed that FA with Q8_0 K-cache does not work. Llama-3.2-1B has a head size of 64 and there was a comment in the code that Q8_0 does not work for a head sizes less than 128, so I fixed that as well.

@ikawrakow ikawrakow merged commit 4a73c25 into main Jan 29, 2025
ikawrakow pushed a commit that referenced this pull request Jan 29, 2025
ikawrakow added a commit that referenced this pull request Jan 30, 2025
* Slightly faster AVX2 implementation for q4_k_r4

* Even better AVX2 implementation for q4_k_r4

We now arrive at PP-512 = 328 t/s for LLaMA-3.1-8B on a
Ryzen-5975WX CPU, up from 291 t/s when I last measured
on 3c5f872.
With FA and Q8_0 K-cache we get to 339.5 t/s.

* Fix llama-bench labels that I broke with #181

* Faster AVX2 implementation for q5_k_q4

We arrive at 302 t/s for LLaMA-3.1-8B on a Ryzen-5975WX CPU,
up from 273 t/s.

* Use AVX2 implementation of q4_k_r4 and q5_k_r4 also on Zen4

After the changes I made to AVX2, it ends up being slightly faster
compared to what I had for Zen4.

* Minor tweak

* Cleanup

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant