-
Notifications
You must be signed in to change notification settings - Fork 105
q8_k_r8: experimental AVX512 version #610
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
On my 7950X this is slower than what we have on the main branch
Wow! 🚀 this little amd 9950x can really rip with its "real" 512-bit instruction!!! The chart is getting too busy, but left everything to show how crazy it is to see faster PP on my 16x gaming rig that the 24x core thread ripper pro! 😮🎉🥂 EDIT the title is a bit misleading, that commit was used for the earlier tests. The actual commit used is shown in the legend in tiny tiny hard to read font. thanks. ![]() 👈 DetailsThe other data and info is over on #602 Q8_K_R8 9950X 16x PR610 ik/q8_k_r8_avx512@c462c5bd
|
OK, then, I'll create a way to select one of the two kernels at build time. Yes, the 9950X is really nice. I was tempted to upgrade when it came out, but at the end didn't because AMD didn't do anything for memory bandwidth. |
Wish there was a way to use AVX-512 without the ML extensions. Or would it not provide any benefit over AVX2? |
Do I need specific model quants to test it? I tried using anikifoss/Kimi-K2-Instruct-DQ4_K and bartowski/Qwen3-235-A22B-Q8_0 with |
If I understand correctly this only effects quants that use q8_k_r8 path so I don't think your Q8_0 would be effected nor your q4_K/q6_K quants which use different paths as i tried to find a way to describe here in an older buried comment. I think this would be a list of the current quants that if are in your mix you might see a boost in PP using this PR on a Zen5 CPU: 👈 supported quants$ grep Q8_K_R8 ggml/src/iqk/iqk_mul_mat.cpp | grep type
case GGML_TYPE_IQ2_XXS: return nrc_y >= 32 ? GGML_TYPE_Q8_K_R8 : type;
case GGML_TYPE_IQ2_XS : return nrc_y >= 32 ? GGML_TYPE_Q8_K_R8 : type;
case GGML_TYPE_IQ2_S : return nrc_y >= 16 ? GGML_TYPE_Q8_K_R8 : type;
case GGML_TYPE_IQ3_XXS: return nrc_y >= 32 ? GGML_TYPE_Q8_K_R8 : type;
case GGML_TYPE_IQ4_XS : return nrc_y >= 32 ? GGML_TYPE_Q8_K_R8 : type;
case GGML_TYPE_IQ3_S : return nrc_y >= 32 ? GGML_TYPE_Q8_K_R8 : type;
case GGML_TYPE_IQ1_S : return nrc_y >= 32 ? GGML_TYPE_Q8_K_R8 : type;
case GGML_TYPE_IQ1_M : return nrc_y >= 32 ? GGML_TYPE_Q8_K_R8 : type;
case GGML_TYPE_Q2_K : return nrc_y >= 32 ? GGML_TYPE_Q8_K_R8 : type;
case GGML_TYPE_Q3_K : return nrc_y >= 32 ? GGML_TYPE_Q8_K_R8 : type;
case GGML_TYPE_IQ2_KS : return nrc_y >= 32 ? GGML_TYPE_Q8_K_R8 : type;
case GGML_TYPE_IQ2_K : return nrc_y >= 32 ? GGML_TYPE_Q8_K_R8 : type;
case GGML_TYPE_IQ2_KL : return nrc_y >= 32 ? GGML_TYPE_Q8_K_R8 : type;
case GGML_TYPE_IQ3_KS : return nrc_y >= 32 ? GGML_TYPE_Q8_K_R8 : type;
case GGML_TYPE_IQ3_K : return nrc_y >= 32 ? GGML_TYPE_Q8_K_R8 : type;
case GGML_TYPE_IQ4_KS : return nrc_y >= 32 ? GGML_TYPE_Q8_K_R8 : type;
case GGML_TYPE_IQ4_K : return nrc_y >= 32 ? GGML_TYPE_Q8_K_R8 : type;
case GGML_TYPE_IQ5_KS : return nrc_y >= 32 ? GGML_TYPE_Q8_K_R8 : type;
case GGML_TYPE_IQ5_K : return nrc_y >= 32 ? GGML_TYPE_Q8_K_R8 : type;
case GGML_TYPE_IQ6_K : return nrc_y >= 32 ? GGML_TYPE_Q8_K_R8 : type;
case GGML_TYPE_Q2_K : return nrc_y >= 32 ? GGML_TYPE_Q8_K_R8 : type;
case GGML_TYPE_Q3_K : return nrc_y >= 32 ? GGML_TYPE_Q8_K_R8 : type;
case GGML_TYPE_IQ1_S : return nrc_y >= 32 ? GGML_TYPE_Q8_K_R8 : type;
case GGML_TYPE_IQ1_M : return nrc_y >= 32 ? GGML_TYPE_Q8_K_R8 : type;
case GGML_TYPE_IQ2_XXS: return nrc_y >= 32 ? GGML_TYPE_Q8_K_R8 : type;
case GGML_TYPE_IQ2_XS : return nrc_y >= 32 ? GGML_TYPE_Q8_K_R8 : type;
case GGML_TYPE_IQ2_S : return nrc_y >= 32 ? GGML_TYPE_Q8_K_R8 : type;
case GGML_TYPE_IQ3_XXS: return nrc_y >= 32 ? GGML_TYPE_Q8_K_R8 : type;
case GGML_TYPE_IQ3_S : return nrc_y >= 32 ? GGML_TYPE_Q8_K_R8 : type;
case GGML_TYPE_IQ4_XS : return nrc_y >= 32 ? GGML_TYPE_Q8_K_R8 : type;
case GGML_TYPE_IQ2_KS : return nrc_y >= 32 ? GGML_TYPE_Q8_K_R8 : type;
case GGML_TYPE_IQ2_KL : return nrc_y >= 32 ? GGML_TYPE_Q8_K_R8 : type;
case GGML_TYPE_IQ3_KS : return nrc_y >= 32 ? GGML_TYPE_Q8_K_R8 : type;
case GGML_TYPE_IQ4_KS : return nrc_y >= 32 ? GGML_TYPE_Q8_K_R8 : type;
case GGML_TYPE_IQ5_KS : return nrc_y >= 32 ? GGML_TYPE_Q8_K_R8 : type;
case GGML_TYPE_IQ2_K : return nrc_y >= 32 ? GGML_TYPE_Q8_K_R8 : type;
case GGML_TYPE_IQ3_K : return nrc_y >= 32 ? GGML_TYPE_Q8_K_R8 : type;
case GGML_TYPE_IQ4_K : return nrc_y >= 32 ? GGML_TYPE_Q8_K_R8 : type;
case GGML_TYPE_IQ5_K : return nrc_y >= 32 ? GGML_TYPE_Q8_K_R8 : type;
case GGML_TYPE_IQ6_K : return nrc_y >= 32 ? GGML_TYPE_Q8_K_R8 : type; I'm not sure how psure your CPU should support it though as it is Zen5, and i have no idea about windows compiling. on linux i run You could possibly give this quant a try as it is mostly quants from this list: https://huggingface.co/ubergarm/Qwen3-235B-A22B-Instruct-2507-GGUF#iq5_k-161722-gib-5909-bpw I measured slightly better PPL than the larger DQ4_K, but I used an imatrix so go with whatever you prefer. |
Yes, pick one of the quantization types in the list provided by @ubergarm to see if it makes a difference on your Zen5 CPU.
Do not use |
@ubergarm This is specifically for your 9950X CPU.
On my 7950X this is ~10% slower than what we have on the main branch. The 7950X supports
AVX512
, but 512-bit instructions get executed as two 256-bit instructions. Hence, I'm expecting (hoping?) thisQ8_K_R8
GEMM version to be significantly faster on a CPU with "real" 512-bit instructions such as the 9950X.Please benchmark it so I can decide if it is worth adding this to the main branch.