ggml-cpu : Add GGML_CPU_FFAST_MATH for sine autovectorization #1243

danielzgtg · 2025-05-28T06:26:02Z

End-user time savings of 12.99% was observed on a previous ggml version. This was upstreamed from my work at mmwillet#2 . I only need sine but cos is included for symmetry.

GCC supports this optimization. This is opt-in as libmvec's output differs from libm.

Before

ggml_compute_forward_sin wasn't vectorized at all. When calculating sine, it suffered the overhead of a call for each element. During profiling using an outdated fork, each GGML_OP_SIN was ~5–10 ms.

$ nm libggml-cpu.so | grep sinf
                 U sinf@GLIBC_2.2.5

After

Now it works in blocks of 4 or 8 elements at a time and approximates more roughly. During profiling using an outdated fork, each GGML_OP_SIN was ~3 ms.

$ nm libggml-cpu.so | grep sinf
                 U sinf@GLIBC_2.2.5
                 U _ZGVbN4v_sinf@GLIBC_2.22
                 U _ZGVdN8v_sinf@GLIBC_2.22

slaren · 2025-05-31T11:43:23Z

Does this have any advantages over AVX intrinsics such as _mm256_sin_ps?

danielzgtg · 2025-05-31T14:01:49Z

_mm256_sin_ps is not available on any compiler besides Intel

slaren · 2025-05-31T15:04:47Z

That's unfortunate. It looks like it is also available on MSVC, but that still would not be enough. I would be very wary about introducing an optimization that only works in a few compilers and requires a flag to enable, as it is going to increase the complexity of the code, and it would essentially become an obscure compilation flag that few people know about and take advantage of. Since these functions are not typically used in LLMs, I don't think it would be worth the maintenance burden. A more generic implementation e.g. using a lookup table would be better.

danielzgtg · 2025-05-31T19:32:40Z

I thought one ffast-math flag is simpler than manually using SIMD intrinsic. People who want a generic speedup turn it on application-wide..

The sine function is heavily used in text-to-speech LLMs. It is used by DAC (Dia, Parler, etc.), and Kokoro, among others LLMs.

I previously used a lookup table that was modulo the sign and M_PI, leaving 1 exponent bit and 23 mantissa bits. However, the compiler failed to autovectorize it into _mm256_i32gather_ps.

You suggested _mm256_sin_ps ((vector: 256) / (float: 32) = 8 elements), which is basically the same thing as the generated _ZGVdN8v_sinf on Linux. I didn't do this at first because it'll need splitting the vec_unary_op template. Should we try with #define _mm256_sin_ps _ZGVdN8v_sinf ifdef GCC (and even Clang where this define should also work)?

slaren · 2025-05-31T20:11:13Z

The complexity would come from introducing a different way to optimize functions. If somebody tomorrow wants to optimize this or other unary function using intrinsics, they will need to write a new path that will also need to be maintained. Maybe some day somebody will also figure that #pragma omp simd works in some cases and add a new path for some functions. And very soon we will have completely unmaintainable code. Currently we use intrinsics and inline assembly to optimize code, so the obvious path with the least friction would be to continue doing it that way.

If _ZGVdN8v_sinf can be used as an alias of _mm256_sin_ps, I think that could work. I would hope that it would also work with clang, particularly on Windows.

danielzgtg · 2025-05-31T23:13:08Z

Here is my plan:

Use _ZGVdN8v_sinf on GCC and Clang on Linux. It's part of the Linux-only glibc
Use _mm256_sin_ps with the Intel compiler on any platform.
Otherwise, use a lookup table if the flag (will rename to GGML_CPU_FAST_TRIG) is still specified
Undo my splitting into multiple files

#pragma openmp simd is unnecessary despite being related to where I got this idea from. Removing the pragma in the example and only adding the ffast-math worked equally well.

Draftifying this until I get to rebasing the downstream fork. Using intrinsics requires more testing than compiler flags.

ggml-cpu : Add GGML_CPU_FFAST_MATH for sine autovectorization

e3783fd

danielzgtg force-pushed the feat/ffast-sine branch from 7c3223b to e3783fd Compare May 28, 2025 06:40

danielzgtg changed the title ~~ggml-cpu : Add GGML_OPENMP_FFAST_MATH for sine autovectorization~~ ggml-cpu : Add GGML_CPU_FFAST_MATH for sine autovectorization May 28, 2025

danielzgtg marked this pull request as draft May 31, 2025 23:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ggml-cpu : Add GGML_CPU_FFAST_MATH for sine autovectorization #1243

ggml-cpu : Add GGML_CPU_FFAST_MATH for sine autovectorization #1243

Uh oh!

danielzgtg commented May 28, 2025 •

edited

Loading

Uh oh!

slaren commented May 31, 2025

Uh oh!

danielzgtg commented May 31, 2025

Uh oh!

slaren commented May 31, 2025

Uh oh!

danielzgtg commented May 31, 2025 •

edited

Loading

Uh oh!

slaren commented May 31, 2025

Uh oh!

danielzgtg commented May 31, 2025

Uh oh!

Uh oh!

ggml-cpu : Add GGML_CPU_FFAST_MATH for sine autovectorization #1243

Are you sure you want to change the base?

ggml-cpu : Add GGML_CPU_FFAST_MATH for sine autovectorization #1243

Uh oh!

Conversation

danielzgtg commented May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Before

After

Uh oh!

slaren commented May 31, 2025

Uh oh!

danielzgtg commented May 31, 2025

Uh oh!

slaren commented May 31, 2025

Uh oh!

danielzgtg commented May 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

slaren commented May 31, 2025

Uh oh!

danielzgtg commented May 31, 2025

Uh oh!

Uh oh!

danielzgtg commented May 28, 2025 •

edited

Loading

danielzgtg commented May 31, 2025 •

edited

Loading