-
Notifications
You must be signed in to change notification settings - Fork 1.3k
ggml-cpu : Add GGML_CPU_FFAST_MATH for sine autovectorization #1243
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
7c3223b
to
e3783fd
Compare
Does this have any advantages over AVX intrinsics such as |
_mm256_sin_ps is not available on any compiler besides Intel |
That's unfortunate. It looks like it is also available on MSVC, but that still would not be enough. I would be very wary about introducing an optimization that only works in a few compilers and requires a flag to enable, as it is going to increase the complexity of the code, and it would essentially become an obscure compilation flag that few people know about and take advantage of. Since these functions are not typically used in LLMs, I don't think it would be worth the maintenance burden. A more generic implementation e.g. using a lookup table would be better. |
I thought one ffast-math flag is simpler than manually using SIMD intrinsic. People who want a generic speedup turn it on application-wide.. The sine function is heavily used in text-to-speech LLMs. It is used by DAC (Dia, Parler, etc.), and Kokoro, among others LLMs. I previously used a lookup table that was modulo the sign and M_PI, leaving 1 exponent bit and 23 mantissa bits. However, the compiler failed to autovectorize it into _mm256_i32gather_ps. You suggested _mm256_sin_ps ((vector: 256) / (float: 32) = 8 elements), which is basically the same thing as the generated |
The complexity would come from introducing a different way to optimize functions. If somebody tomorrow wants to optimize this or other unary function using intrinsics, they will need to write a new path that will also need to be maintained. Maybe some day somebody will also figure that If |
Here is my plan:
Draftifying this until I get to rebasing the downstream fork. Using intrinsics requires more testing than compiler flags. |
End-user time savings of 12.99% was observed on a previous ggml version. This was upstreamed from my work at mmwillet#2 . I only need sine but cos is included for symmetry.
GCC supports this optimization. This is opt-in as libmvec's output differs from libm.
Before
ggml_compute_forward_sin wasn't vectorized at all. When calculating sine, it suffered the overhead of a call for each element. During profiling using an outdated fork, each GGML_OP_SIN was ~5–10 ms.
After
Now it works in blocks of 4 or 8 elements at a time and approximates more roughly. During profiling using an outdated fork, each GGML_OP_SIN was ~3 ms.