Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
PR started by me adding the
-gp
option tollama-bench
as per ggml-org/llama.cpp#11126 because I wanted to test TG performance after a long prompt to be able to compare to the MLA attention implementation in ggml-org/llama.cpp#11446.But then I noticed that the repacked
Q8_0
andQ4_0
quants do not work for row tensor sizes that are not a multiple of 128 (4 x block size of 32), which is the case for some of the tensors in Deepseek2-Lite that I used for testing, so I fixed that.And than I was comparing performance after the fix on
Llama-3.2-1B
, and noticed that FA withQ8_0
K-cache does not work.Llama-3.2-1B
has a head size of 64 and there was a comment in the code thatQ8_0
does not work for a head sizes less than 128, so I fixed that as well.