Skip to content

redo sinpi and cospi kernel polynomial approximation #59031

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

nsajko
Copy link
Contributor

@nsajko nsajko commented Jul 18, 2025

  • Make the Float32 variants use the same evaluation scheme already used for Float64, instead of computing in a wider arithmetic and then rounding back to the narrower arithmetic. This of course results in some loss of accuracy, however it seems more appropriate not to rely on other arithmetics, I guess. Even though the accuracy is worse, the result still seems experimentally to be faithful for each argument from the domain of the kernel, with the ULP error is below 1.

  • Regenerate the polynomials for Float64, too, improving accuracy. The degrees of the polynomials stay the same.

Max errors over the domain (0 to 0.25) of the kernels:

before
  func = cospi
    type = Float32
      (a, b) = (0.0, 0.25)
      max = 0.50000286f0
    type = Float64
      (a, b) = (0.0, 0.25)
      max = 0.90289307f0
  func = sinpi
    type = Float32
      (a, b) = (0.0, 0.25)
      max = 0.5000706f0
    type = Float64
      (a, b) = (0.0, 0.25)
      max = 0.6906576f0

after
  func = cospi
    type = Float32
      (a, b) = (0.0, 0.25)
      max = 0.8907261f0
    type = Float64
      (a, b) = (0.0, 0.25)
      max = 0.9042959f0
  func = sinpi
    type = Float32
      (a, b) = (0.0, 0.25)
      max = 0.9644079f0
    type = Float64
      (a, b) = (0.0, 0.25)
      max = 0.6577196f0

@nsajko nsajko added the maths Mathematical functions label Jul 18, 2025
@nsajko
Copy link
Contributor Author

nsajko commented Jul 18, 2025

My laptop is currently running a search for max ULP errors which will last several hours.

@oscardssmith
Copy link
Member

@nsajko at least for Float16, Float32, it shouldn't take nearly that long. (especially if you use the x axis symmetry)

julia> function testf(f::F, T) where F
           maxulp = widen(zero(T))
           argmaxulp = T(NaN)
           for i in reinterpret(Unsigned, floatmin(T)):reinterpret(Unsigned, floatmax(T))
               x = reinterpret(T, i)
               fx = f(x)
               fxbig = f(widen(x))
               ulp = T(abs(fxbig-fx)) / eps(max(abs(fx), floatmin(T)))
               if ulp > maxulp
                   maxulp = ulp
                   argmaxulp=x
               end
           end
           return (argmaxulp, maxulp)
       end
testf (generic function with 1 method)

julia> @time testf(cospi, Float32)
 34.722836 seconds (3 allocations: 48 bytes)
(0.42245033f0, 0.50007045f0)

@oscardssmith
Copy link
Member

Also, IMO doing doublefloat Float16 almost certainly isn't worth it. No hardware that I'm aware of has slow enough Float32 performance to make that worthwhile.

@vchuravy
Copy link
Member

Also, IMO doing doublefloat Float16 almost certainly isn't worth it. No hardware that I'm aware of has slow enough Float32 performance to make that worthwhile.

:trollface: future GPU/TPUs

@oscardssmith
Copy link
Member

oscardssmith commented Jul 18, 2025

@vchuravy even there, are there systems where FP16 throughput is >4x FP32? I can see the argument for minimizing FP32 usage in FP16 impls, but going to FP16x2 seems unlikely to be better.

@nsajko
Copy link
Contributor Author

nsajko commented Jul 20, 2025

for Float16, Float32, it shouldn't take nearly that long

My code is slower because I'm using BigFloat to be extra sure in the results.

@nsajko
Copy link
Contributor Author

nsajko commented Jul 20, 2025

The accuracy for cospi(::Float64) seems like it may actually have regressed due to the coefficients being regenerated with Sollya, from 0.90289307f0 ULP to 0.9042959f0 ULP. That said, I suppose it's just a measurement error due to not being able to search through the Float64 space exhaustively.

@nsajko
Copy link
Contributor Author

nsajko commented Jul 20, 2025

A single test fails:

Error in testset math:
Test Failed at /cache/build/tester-amdci5-9/julialang/julia-master/julia-c968bd880c/share/julia/test/math.jl:634
  Expression: ≈(cosc(0.14), -0.4517331883801308, rtol = 1.0e-15)
   Evaluated: -0.4517331883801315 ≈ -0.4517331883801308 (rtol=1.0e-15)
ERROR: LoadError: Test run finished with errors
in expression starting at /cache/build/tester-amdci5-9/julialang/julia-master/julia-c968bd880c/share/julia/test/runtests.jl:99

It tests the accuracy of cosc(0.14::Float64). The relevant cosc method is currently implemented in terms of cospi and sinpi:

julia/base/special/trig.jl

Lines 1129 to 1136 in c1c091e

# hard-code Float64/Float32 Taylor series, with coefficients
# Float64.([(-1)^n*big(pi)^(2n)/((2n+1)*factorial(2n-1)) for n = 1:6])
_cosc(x::Union{Float64,ComplexF64}) =
fastabs(x) < 0.14 ? x*evalpoly(x^2, (-3.289868133696453, 3.2469697011334144, -1.1445109447325053, 0.2091827825412384, -0.023460810354558236, 0.001781145516372852)) :
isinf_real(x) ? zero(x) : ((pi*x)*cospi(x)-sinpi(x))/((pi*x)*x)
_cosc(x::Union{Float32,ComplexF32}) =
fastabs(x) < 0.26f0 ? x*evalpoly(x^2, (-3.289868f0, 3.2469697f0, -1.144511f0, 0.20918278f0)) :
isinf_real(x) ? zero(x) : ((pi*x)*cospi(x)-sinpi(x))/((pi*x)*x)

At x = 0.14 this PR indeed returns a slightly less accurate result than master for sinpi: the error goes from 0.49043456f0 ULP to 0.5095654f0 ULP.

It could be argued the test is overly strict, however I'll go and reimplement cosc for Float32 and Float64 instead of relaxing it.

EDIT: PR #59087

@giordano
Copy link
Member

even there, are there systems where FP16 throughput is >4x FP32?

I believe that's the case already for TPUs?

@nsajko
Copy link
Contributor Author

nsajko commented Jul 24, 2025

Regarding Float16, I basically gave up on that because getting the error to below 1 ULP without reaching out to a wider type seems challenging.

nsajko added 6 commits July 24, 2025 21:21
* Make the `Float16` and `Float32` variants use the same evaluation
  scheme already used for `Float64`, instead of computing in a wider
  arithmetic and then rounding back to the narrower arithmetic. This of
  course results in some loss of accuracy, however it seems more
  appropriate not to rely on other arithmetics, I guess. Even though
  the accuracy is worse, the result still seems experimentally to be
  faithful for each argument from the domain of the kernel in each case
  except for `sinpi(::Float16)` which has max error of about 1.4 ULP.

* Regenerate the polynomials for `Float64`, too, improving accuracy.
  The degrees of the polynomials stay the same.

Closes JuliaLang#58977
A future PR or a future commit in this PR could also regenerate the
polynomials for the wide kernels for `Float16` and `Float32`.
The old `sinpi` value had an error of 0.56462383 ULP, now it's just
0.4353762 ULP. Given that the result is more accurate now, I just
updated the doc test.
Relax the doc test, don't check exact floating-point equality.
@nsajko nsajko force-pushed the sinpi_cospi_kernel_polynomial_approximations branch from a8f1974 to 4af0f03 Compare July 24, 2025 19:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
maths Mathematical functions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants