Skip to content

WIP: The great pairwise reduction refactor #58418

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 57 commits into
base: master
Choose a base branch
from

Conversation

mbauman
Copy link
Member

@mbauman mbauman commented May 14, 2025

This is the grand culmination of (and core motivation for) all the reductions work I've slowly been chipping away at over the past year. It implements the following major changes:

  • All reduce-based reductions use a pairwise re-association by default. This includes seven different implementations:
    • Whole array reductions
      • IndexLinear (these were often pairwise already)
      • IndexCartesian (these were previously foldls without SIMD)
      • Iterators with Shape or Length (previously foldls without SIMD)
      • Iterators with unknown length (previously foldls without SIMD)
    • Dimensional array reductions
      • Contiguous column major dimensions (only IndexLinear were pairwise before)
      • Discontiguous column major dimensions
      • Row major dimensions
  • All of these implementations do their inner work through a singular mapreduce_kernel function (with specific signatures for arrays and iterators) after performing the pairwise splits and/or dimensional selection. This could eventually become a public API that should help alleviate some of the disruption to those who had been extending the internals before. See, for example, how LinearAlgebra replaced its internal and tricky _sum overloads with this method: JuliaLang/LinearAlgebra.jl@e70953e.
  • The dimensional array reductions no longer guess at initialization. This fully replaces my previous attempt in WIP: a more principled take on dimensional reduction inits #55318, and is at the core of what I want to address here — the guesswork here is what was responsible for many bugs.
  • All of those implementations consistently use init to start every reduction chain as we decided the documentation should direct implementations to do in Document that reduce-like functions use init one or more times #53945. They also consistently use mapreduce_first.

This currently stands atop the yet-to-be-merged #58374 and #55628.

As the successor to #55318, this carries all its tests and addresses all those issues in the dimensional reductions. Closes: #45566, closes #47231, closes #55213, closes #31427, closes #41054, closes #54920, closes #42259, closes #38660, closes #21097, closes #32366, closes #54875, closes #43731, closes #45562.

In addition, this further brings pairwise reassociations (and SIMD!) to the other class of reduction. Closes #29883, closes #52365, closes #30421, closes #52457, closes #39385.

There are other pull requests that this supersedes that were inspirational for my work here. In particular, this supersedes and would close #45651 (pairwise whole-array IndexCartesian), closes #52397 (pairwise whole-array Iterators.SizeUnknown). It also supersedes a few other PRs; would close #45822 (findmin/max for offset arrays; I've added those tests here), close #58241 (my PR for whole-array reductions init refactor that snowballed into this one).

Yet to do:

mbauman and others added 20 commits May 12, 2025 11:07
(cherry picked from commit 4bee191)
for accumulate inference

(cherry picked from commit 4fadd8f)
Previously, Julia tried to guess at initial values for empty dimensional reductions in a slightly different way than whole-array reductions. This change unifies those behaviors, such that `mapreduce_empty` is called for empty dimensional reductions, just like it is called for empty whole-array reductions.

Beyond the unification (which is valuable in its own right), this change enables some downstream simplifications to dimensional reductions and is likely to be the "most breaking" public behavior in a refactoring targeting more correctness improvements.

(cherry picked from commit 94f24b8)
they already supported axes through the default fallback that uses size to construct OneTos
and this was causing ambiguities
(cherry picked from commit 32fc16b)
@mbauman mbauman marked this pull request as draft May 14, 2025 23:03
@giordano giordano added needs nanosoldier run This PR should have benchmarks run on it fold sum, maximum, reduce, foldl, etc. needs tests Unit tests are required for this change labels May 14, 2025
@mbauman mbauman added needs news A NEWS entry is required for this change needs docs Documentation for this change is required labels May 14, 2025
@tecosaur
Copy link
Member

Fantastic stuff, this is really exciting to see! 🥳

length(t::ImmutableDict) = count(Returns(true), t)
# length is defined in terms of iteration; using a higher-order function to do the iteration
# is likely to call `length` again, so this is a manual for loop:
function length(t::ImmutableDict)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

won't this be horribly slow?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, wasn't it already doing this? This shouldn't be any worse. It's definitely faster than infinite recursion :)

The trouble is that ImmutableDict has an IteratorSize of HasLength. And so the new reduction machinery happily calls length to determine the pairwise splits. So if length itself uses reductions (like count) to calculate length, we're in trouble.

On nightly:

julia> using BenchmarkTools

julia> d = Base.ImmutableDict((i => i+1 for i in 1:200)...);

julia> @benchmark length($d)
BenchmarkTools.Trial: 10000 samples with 641 evaluations per sample.
 Range (min  max):  193.512 ns  392.225 ns  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     193.772 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   196.869 ns ±  11.655 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▁  ▁▃▂▄▂                                                     ▁
  ███████████▇██▇▇▆▇▆▆▆▅▅▇▆▆▅▆▅▅▆▅▅▅▅▅▅▅▄▄▅▅▅▅▂▄▃▃▄▄▄▄▃▂▂▂▃▄▂▅▄ █
  194 ns        Histogram: log(frequency) by time        241 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @eval Base function length(t::ImmutableDict)
           r = 0
           for _ in t
               r+=1
           end
           return r
       end
length (generic function with 95 methods)

julia> @benchmark length($d)
BenchmarkTools.Trial: 10000 samples with 631 evaluations per sample.
 Range (min  max):  193.542 ns  366.482 ns  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     193.740 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   196.829 ns ±  11.533 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █   ▁ ▄ ▂                                                     ▁
  ███████▇███▇██▇▇▇▇▆▇▆▆▅▆▆▆▇▆▆▆▅▅▅▅▅▅▄▅▆▅▄▅▅▅▅▄▄▅▄▄▄▅▅▄▄▃▄▄▃▃▄ █
  194 ns        Histogram: log(frequency) by time        238 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah that's unfortunate...

@jakobnissen
Copy link
Member

Worth seeing if it also closes #43501

@mbauman
Copy link
Member Author

mbauman commented May 27, 2025

OK, getting better. And there's still a good amount of low-hanging that should resolve more. I've fixed a few items, but some outstanding ones in the top ten listed (across the failure modes) that I haven't fully investigated or fixed or flagged upstream yet:

  • There are the failures from bad (non-neutral) inits from Whole-array reductions always use init to start each reduction chain #58241 (OnlineStats & AcceleratedKernels)
  • Calling norm(itr::ComponentArrays.ComponentVector{Float64, Vector{Float64}, Tuple{ComponentArrays.Axis{(p1 = 1, p2 = 2, p3 = 3, p4 = 4, p5 = 5)}}}) ends up indexing into the axes of the component vector in a way that seems to be unexpected and is leading to an ambiguity error: getindex(::ComponentArrays.CombinedAxis{ComponentArrays.Axis{(p1 = 1, p2 = 2, p3 = 3, p4 = 4, p5 = 5)}, Base.OneTo{Int64}}, ::UnitRange{Int64}) is ambiguous in the mapreduce_kernel where we do the inds[begin+1:end] loop.
  • ColorVectorSpace and YaoArrayRegister checks exact floating point equality in a reduction
  • MatrixNetworks has some perf tests and we're 3x slower than it expects at https://github.com/JuliaGraphs/MatrixNetworks.jl/blob/ff4ea518385509b83de24e34cde5eadb430bf5f4/test/diffusions_test.jl#L227-L230. Not sure why yet
  • DiskArrays and DiskArrayTools explicitly tests how many times it was indexed into during a dimensional sum (dims=1). It expects to be indexed into 10 times, but it's reporting 60. This is surely because the DiskArrays.jl specialized on Base._mapreduce to implement some sort of chunked optimization — that's now no longer taking effect.
  • Brillouin.jl gets a -0.0 out of cartesianize but expects a positive zero. The MWE there is in StaticArrays:
    julia> vert = [  0.6666666666666665, -0.33333333333333326,  0.49999999999999983];
    
    julia> basis = [ [6.283185307179586, 3.6275987284684357, 0.0], [0.0, 7.255197456936871, 0.0], [0.0, 0.0, 5.026548245743669]];
    
    julia> vert' * basis
    3-element Vector{Float64}:
     4.18879020478639
     0.0
     2.5132741228718336
    
    julia> SVector{3}(vert)' * SVector{3}(SVector{3}.(basis))
    3-element SVector{3, Float64} with indices SOneTo(3):
      4.18879020478639
     -5.5126462140537006e-17
      2.5132741228718336
    
    julia> sum(uu*vv for (uu, vv) in zip(SVector{3}(vert)', SVector{3}(SVector{3}.(basis))))
    3-element SVector{3, Float64} with indices SOneTo(3):
      4.18879020478639
     -5.5126462140537006e-17
      2.5132741228718336
  • ChainRulesCore has a NoTangent iterator that claims to have length (well, it's the default) but does not. Similarly for IterativeSolvers.IDRSIterable.
  • StatsBase is picking up the wrong Statistics?
  • FillArrays and BandedMatrices is an independent breakage on master (that's since been fixed, looks like)
  • Clustering is a SparseArrays problem, calling widelength on a subarray of a sparse in the new mapreduce kernel. Should be an easy fix there.
  • JuliaInterpreter expects to hit length(LinearIndices(a)) in the sum implementation, which we no longer call
  • Measurements.Derivatives uses count to determine its length
  • Strided.jl uses reducedim_init in ways that aren't supported by mapreduce_empty.

@giordano
Copy link
Member

That was based on the definition of length(::ImmutableDict), which you changed here. Could be changed there as well to mirror the new definition.

@mbauman
Copy link
Member Author

mbauman commented May 27, 2025

OK, the dot product of StaticArrays is a fascinating one. This is precisely the case I conjectured about above — the * here is getting marked contract and fusing with the reduction operator in the @simd loop. It's the difference between:

julia> x = [0.6666666666666665, -0.33333333333333326];

julia> y = [3.6275987284684357, 7.255197456936871];

julia> x[1]*y[1] + x[2]*y[2]
0.0

julia> fma(x[2], y[2], x[1]*y[1])
-5.5126462140537006e-17

julia> using LinearAlgebra: BLAS

julia> BLAS.dot(x,y)
-5.5126462140537006e-17

This is a classic example — most commonly seen in complex arithmetic. You definitely don't want complex * to use fma. Is this similar?

julia> sum(x->x*1.0, [-.1,.1])
0.0

julia> sum(x->x*0.1, [-.1,.1])
-8.326672684688674e-19

That's... kinda weird, right? The above is on this branch, but you can also observe this on master in slightly more contrived examples:

julia> sum(x->x*0.1, [-.1,.1])
0.0

julia> sum(x->x*0.1, [-.1; .1; zeros(13)])
0.0

julia> sum(x->x*0.1, [0.0; -.1; .1; zeros(13)])
-8.326672684688674e-19

I suppose the point is that — if you're fusing a * into a sum, then you definitionally have the above asymmetry in precision. But... BLASes have long done this in their dot-products, so perhaps I'm just being numerically naive.

@vtjnash
Copy link
Member

vtjnash commented May 27, 2025

StaticArrays explicitly requests the fastmath version of fma here: https://github.com/JuliaArrays/StaticArrays.jl/blob/563adebc264a23d38da97bf5d9e5d3c6333bdf26/src/matrix_multiply_add.jl#L428

@mbauman
Copy link
Member Author

mbauman commented May 27, 2025

I have benchmarks! They're... ok. A bit of a mixed bag.

The code
using Chairmarks: Chairmarks, @be
using CSV: CSV
using ProgressMeter: Progress, next!

struct LengthUnknown{T}
    itr::T
end
@inline Base.iterate(x::LengthUnknown) = iterate(x.itr)
@inline Base.iterate(x::LengthUnknown, s) = iterate(x.itr, s)
Base.IteratorSize(::LengthUnknown) = Base.SizeUnknown()
Base.IteratorEltype(::LengthUnknown) = Base.HasEltype()
Base.eltype(u::LengthUnknown) = Base.eltype(u.itr)

using Random: Random
const rng = Random.MersenneTwister(1155)
rand(sz...) = Random.rand(rng, sz...)
rand(::Type{T}, sz...) where {T} = Random.rand(rng, T, sz...)

haslength(x) = Iterators.take(x,length(x))
function cartesianvec(x)
    n = length(x)
    n1 = nextpow(2, sqrt(n))
    return @view reshape(x, n1, :)[begin:end, begin:end]
end

function bench_wholearray(Ns, Ts)
    results = (; style=String[], eltype=String[], init=Bool[], dims=Union{Missing,String}[], n=Int[], time=Float64[])
    P = Progress(4*4*2*length(Ts))
    for (xf,type) in [(identity, "IndexLinear"), (cartesianvec, "IndexCartesian"), (haslength, "HasLength"), (LengthUnknown, "SizeUnknown")],
        (xf,type) in [(xf, type),
                      (x->skipmissing(xf(Array{Union{Missing, eltype(x)}}(x))), "SkipMissing $type 0%"),
                      (x->skipmissing(xf(Array{Union{Missing, eltype(x)}}([rand() < .5 ? missing : x for x in x]))), "SkipMissing $type 50%"),
                      (x->skipmissing(xf(Array{Union{Missing, eltype(x)}}(fill(missing, size(x))))), "SkipMissing $type 100%"),],
        T in Ts,
        init in (nothing, zero(T))
            ts = [begin
                A = xf(rand(T, N÷sizeof(T)))
                if isempty(A) && isnothing(init) && try mapreduce(identity, +, A); false; catch; true; end
                    # Empty SkipMissings don't define reduce_empty; skip it.
                    [(;time = NaN,)]
                elseif isnothing(init)
                    @be(mapreduce(identity, +, $A))
                else
                    @be(mapreduce(identity, +, $A; init=$init))
                end
            end for N in Ns]
            append!(results.time, getproperty.(minimum.(ts), :time))
            append!(results.n, Ns)
            append!(results.dims, fill(missing, length(ts)))
            append!(results.eltype, fill(string(T), length(ts)))
            append!(results.style, fill(type, length(ts)))
            append!(results.init, fill(!isnothing(init), length(ts)))
            next!(P)
    end
    results
end

function bench_dims(Ns, Ts)
    results = (; style=String[], eltype=String[], init=Bool[], dims=Union{Missing,String}[], n=Int[], time=Float64[])
    P = Progress(2*8*2*length(Ts))
    for (xf,type) in [(identity, "IndexLinear"), (x->@view(x[1:end,1:end,1:end,1:end]), "IndexCartesian")],
        (szf, dims) in [((n,_,_)->(n÷4,4), 1), ((n,_,_)->(4,n÷4), 2),
                        ((n,n2,_)->(n2,n÷n2÷4,4), (1,2)),
                        ((n,n2,_)->(n2,4,n÷n2÷4), (1,3)),
                        ((n,n2,_)->(4,n2,n÷n2÷4), (2,3)),
                        ((n,_,n3)->(n3,n3,n÷n3÷n3÷4,4), (1,2,3)),
                        ((n,_,n3)->(n3,n3,4,n÷n3÷n3÷4), (1,2,4)),
                        ((n,_,n3)->(n3,4,n3,n÷n3÷n3÷4), (1,3,4)),
                        ((n,_,n3)->(4,n3,n3,n÷n3÷n3÷4), (2,3,4))],
        T in Ts,
        init in (nothing, zero(T))
                ts = [begin
                    n = N÷sizeof(T)
                    A = xf(rand(T, szf(n, nextpow(2, sqrt(n)), nextpow(2, cbrt(n)))))
                    if isnothing(init)
                        @be(mapreduce(identity, +, $A; dims=$dims))
                    else
                        @be(mapreduce(identity, +, $A; dims=$dims, init=$init))
                    end
                end for N in Ns]
                append!(results.time, getproperty.(minimum.(ts), :time))
                append!(results.n, Ns)
                append!(results.eltype, fill(string(T), length(ts)))
                append!(results.style, fill(type, length(ts)))
                append!(results.init, fill(!isnothing(init), length(ts)))
                append!(results.dims, fill(string(dims), length(ts)))
                next!(P)
    end
    return results
end

whole = bench_wholearray(2 .^ (3:22), (Int32,Int64,Float32,Float64))
dims = bench_dims(2 .^ (5:22), (Int32,Int64,Float32,Float64))
foreach(Base.splat(append!), zip(whole, dims))
CSV.write(Base.GIT_VERSION_INFO.commit_short * ".csv", whole)

This gives us 744e3fd976.csv vs c5df01807e3.csv (nightly from a few days ago).

So we can break this down over a wide variety of cases.

image image

Most of the slow cases here are the "small n":

image image

The variance is pretty high on SkipMissing — Julia's inference really doesn't like iteration's tuple of Union{Missing, ...}. The few cases with 10x+ slowdown are all instabilities of that sort.

@mbauman
Copy link
Member Author

mbauman commented May 28, 2025

OK, I've implemented specialized kernels for SkipMissing{<:AbstractArray} that mirror the old implementation. That's resolved the most egregious perf issues now; none of tested implementations are worse than a 5x regression (and quite a few are 5x+ faster). Still haven't tested Sparse perf, though, there's definitely more work to be done there.

Here are the latest benchmarks: 144c253358.csv. The remaining big regressions are in mid-size dimensional reductions, and in particulardims=(1,2,4).

@mbauman
Copy link
Member Author

mbauman commented May 28, 2025

OK, awesome, I see what's happening here. When we're doing a dimensional reduction over an array with a singleton dimension, we have a choice whether that dimension should be included in the reduction — it doesn't matter whether it was included in the dims or not! And it's advantageous to include it on leading dimensions (which I had already optimized) and advantageous to remove it on trailing dimensions. The biggest gaps to master right were all in cases where we have a singleton trailing dimension that was flagged as being part of the reduction. E.g., sum(rand(3,2,1), dims=(1,3)). That's the same as the simpler and more efficient dims=1! Easy enough fix.

That's why it was the "mid-size" reductions like (1,2,4) that were appearing as big regressions — those are the ones that have that trailing singleton dimension! Now all the tested benchmarks are within 3x, getting better...

30c8afbc5e.csv

image

There are 260 benchmarks with a 2x or worse regression compared to c5df01807e3.csv. Most (249/260) are dimensional reductions. The whole array ones are either small (n<=64) SizeUnknown iterables or very large (n>million) entirely missing (that is, empty) SkipMissing{<:Array}. The dimensional reductions are mostly IndexCartesian (190/249), and mostly discontiguous column major like (1,3) or (1,2,4) (150/249).

@mbauman mbauman force-pushed the mb/ReduceReuse♻ branch from 00c4f51 to 4807a12 Compare July 3, 2025 17:13
@mbauman
Copy link
Member Author

mbauman commented Jul 3, 2025

OK, I must confess I've never really processed how SIMD doesn't really re-associate things. It fundamentally changes the entire ordering. It goes far above and beyond what we promise reduce will do... of course all the SIMD-able operations are themselves (putatively) commutative, so it's not really observable outside of bit-exact floating point operations. I think it'd be very good to completely remove @simd from the implementations here — or at least get rid of it from the generic ones. Instead, we can introduce specific SIMD-like reorderings for commutative operations. This can actually lead to improved performance over @simd in cases where the data are not stored in a data structure that directly supports a SIMD gather. And explicitly encoding the re-ordering here would fully fix #29883 and #52457; mean(X, dims=1) would be exactly equal to mean(X', dims=2). And if we hard-code the re-ordering, it should completely remove architecture-dependencies from the results of these fundamental operations.

It's not uniformly a win, however. Hard-coding such a re-ordering means that we no longer let @simd pick the best one for the current CPU and element type. This PR now just picks one that works fairly well across the board. I've definitely taken a step backwards from the previous benchmarks. The worst offenders are the containers of Any and small Unions, and mostly where n (and the time itself) is very small. In terms of pure counts, there are 5124 benchmarks in my current battery (going over various sizes, array types, iterator types, dimensions, etc.). With 4807a12 (4807a12823.csv), I see:

  • 0.9x runtime (or less): 1830 benchmarks (36%)
  • 1.1x runtime (or less): 3196 benchmarks (62%)
  • 1.5x runtime (or less): 4031 benchmarks (79%)
  • 2x runtime (or less): 4579 benchmarks (89%)

A bunch of the 2x regressions are the 32-bit ones, because we're no longer saturating my M1's simd width. And those that are much worse than that are mostly small n and/or non-concrete. I think I can add some threshold tuning to improve some of that.

benchmarks with concrete eltypes

benchmarks with non-concrete eltypes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fold sum, maximum, reduce, foldl, etc. needs docs Documentation for this change is required needs nanosoldier run This PR should have benchmarks run on it needs news A NEWS entry is required for this change needs tests Unit tests are required for this change
Projects
None yet