Skip to content

ggml-cpu : Add GGML_CPU_FFAST_MATH for sine autovectorization #1243

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

danielzgtg
Copy link
Contributor

@danielzgtg danielzgtg commented May 28, 2025

End-user time savings of 12.99% was observed on a previous ggml version. This was upstreamed from my work at mmwillet#2 . I only need sine but cos is included for symmetry.

GCC supports this optimization. This is opt-in as libmvec's output differs from libm.

Before

ggml_compute_forward_sin wasn't vectorized at all. When calculating sine, it suffered the overhead of a call for each element. During profiling using an outdated fork, each GGML_OP_SIN was ~5–10 ms.

$ nm libggml-cpu.so | grep sinf
                 U sinf@GLIBC_2.2.5

After

Now it works in blocks of 4 or 8 elements at a time and approximates more roughly. During profiling using an outdated fork, each GGML_OP_SIN was ~3 ms.

$ nm libggml-cpu.so | grep sinf
                 U sinf@GLIBC_2.2.5
                 U _ZGVbN4v_sinf@GLIBC_2.22
                 U _ZGVdN8v_sinf@GLIBC_2.22

@danielzgtg danielzgtg changed the title ggml-cpu : Add GGML_OPENMP_FFAST_MATH for sine autovectorization ggml-cpu : Add GGML_CPU_FFAST_MATH for sine autovectorization May 28, 2025
@slaren
Copy link
Member

slaren commented May 31, 2025

Does this have any advantages over AVX intrinsics such as _mm256_sin_ps?

@danielzgtg
Copy link
Contributor Author

_mm256_sin_ps is not available on any compiler besides Intel

@slaren
Copy link
Member

slaren commented May 31, 2025

That's unfortunate. It looks like it is also available on MSVC, but that still would not be enough. I would be very wary about introducing an optimization that only works in a few compilers and requires a flag to enable, as it is going to increase the complexity of the code, and it would essentially become an obscure compilation flag that few people know about and take advantage of. Since these functions are not typically used in LLMs, I don't think it would be worth the maintenance burden. A more generic implementation e.g. using a lookup table would be better.

@danielzgtg
Copy link
Contributor Author

danielzgtg commented May 31, 2025

I thought one ffast-math flag is simpler than manually using SIMD intrinsic. People who want a generic speedup turn it on application-wide..

The sine function is heavily used in text-to-speech LLMs. It is used by DAC (Dia, Parler, etc.), and Kokoro, among others LLMs.

I previously used a lookup table that was modulo the sign and M_PI, leaving 1 exponent bit and 23 mantissa bits. However, the compiler failed to autovectorize it into _mm256_i32gather_ps.

You suggested _mm256_sin_ps ((vector: 256) / (float: 32) = 8 elements), which is basically the same thing as the generated _ZGVdN8v_sinf on Linux. I didn't do this at first because it'll need splitting the vec_unary_op template. Should we try with #define _mm256_sin_ps _ZGVdN8v_sinf ifdef GCC (and even Clang where this define should also work)?

@slaren
Copy link
Member

slaren commented May 31, 2025

The complexity would come from introducing a different way to optimize functions. If somebody tomorrow wants to optimize this or other unary function using intrinsics, they will need to write a new path that will also need to be maintained. Maybe some day somebody will also figure that #pragma omp simd works in some cases and add a new path for some functions. And very soon we will have completely unmaintainable code. Currently we use intrinsics and inline assembly to optimize code, so the obvious path with the least friction would be to continue doing it that way.

If _ZGVdN8v_sinf can be used as an alias of _mm256_sin_ps, I think that could work. I would hope that it would also work with clang, particularly on Windows.

@danielzgtg
Copy link
Contributor Author

Here is my plan:

  • Use _ZGVdN8v_sinf on GCC and Clang on Linux. It's part of the Linux-only glibc
  • Use _mm256_sin_ps with the Intel compiler on any platform.
  • Otherwise, use a lookup table if the flag (will rename to GGML_CPU_FAST_TRIG) is still specified
  • Undo my splitting into multiple files

#pragma openmp simd is unnecessary despite being related to where I got this idea from. Removing the pragma in the example and only adding the ffast-math worked equally well.

Draftifying this until I get to rebasing the downstream fork. Using intrinsics requires more testing than compiler flags.

@danielzgtg danielzgtg marked this pull request as draft May 31, 2025 23:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants