Skip to content

Help LLVM better vectorize reduction over skipmissing #43859

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 22 additions & 5 deletions base/missing.jl
Original file line number Diff line number Diff line change
Expand Up @@ -310,6 +310,15 @@ _mapreduce(f, op, ::IndexCartesian, itr::SkipMissing) = mapfoldl(f, op, itr)
mapreduce_impl(f, op, A::SkipMissing, ifirst::Integer, ilast::Integer) =
mapreduce_impl(f, op, A, ifirst, ilast, pairwise_blocksize(f, op))

# Some help function to make LLVM better vectorizing these reduction.
# It returns a `noop`, which make sure `op(v, noop) === v`
_fast_noop(::Union{typeof(add_sum),typeof(+)}, ::Type{T}, ::T) where {T<:Union{HWReal,Complex{<:HWReal}}} = zero(T)
_fast_noop(::Union{typeof(mul_prod),typeof(*)}, ::Type{T}, ::T) where {T<:HWReal} = one(T)
# TODO: min/max for IEEEFloat need manually unroll to vectorize.
_fast_noop(::Union{typeof(min),typeof(max)}, ::Type{T}, v::T) where {T<:Integer} = T <: HWReal ? v : nothing
# General fallback
_fast_noop(f, T, v) = nothing

# Returns nothing when the input contains only missing values, and Some(x) otherwise
@noinline function mapreduce_impl(f, op, itr::SkipMissing{<:AbstractArray},
ifirst::Integer, ilast::Integer, blksize::Int)
Expand Down Expand Up @@ -345,10 +354,19 @@ mapreduce_impl(f, op, A::SkipMissing, ifirst::Integer, ilast::Integer) =
i == typemax(typeof(i)) && return Some(op(f(a1), f(a2)))
i += 1
v = op(f(a1), f(a2))
@simd for i = i:ilast
@inbounds ai = A[i]
if ai !== missing
v = op(v, f(ai))
Comment on lines -348 to -351
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On master vectorizer shows the following to sum(skipmissing(Vector{Union{Missing,Float64}}(randn(4096)))) :

LV: Checking a loop in "julia_mapreduce_impl_2411" from simdloop.jl:75 @[ missing.jl:348 ]
LV: Loop hints: force=? width=0 interleave=0
LV: Found a loop: L137
LV: Found an induction variable.
LV: Not vectorizing: Found an unidentified PHI   %value_phi30161 = phi double [ %117, %L137.lr.ph ], [ %value_phi33, %L137 ]
LV: Interleaving disabled by the pass manager
LV: Can't vectorize the instructions or CFG
LV: Not vectorizing: Cannot prove legality.

And the following for sum(skipmissing(Vector{Union{Missing,Int}}(rand(-1:1,4096)))):

LV: Checking a loop in "julia_mapreduce_impl_2505" from simdloop.jl:75 @[ missing.jl:348 ]
LV: Loop hints: force=? width=0 interleave=0
LV: Found a loop: L137
LV: Found an induction variable.
LV: We can vectorize this loop!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Weird that LV: Not vectorizing: Found an unidentified PHI %value_phi30161 = phi double [ %117, %L137.lr.ph ], [ %value_phi33, %L137 ] is incomplete there is only one: Found an unidentified PHI message and that is longer.

Copy link
Member Author

@N5N3 N5N3 Jan 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried the following:

function f(a)
    r = zero(eltype(a))
    @inbounds @simd for i in eachindex(a)
       if a[i] > 0
           r += a[i]
       end
   end
   r
end
function g(a)
   r = zero(eltype(a))
   @inbounds @simd for i in eachindex(a)
       r += a[i] > 0 ? a[i] : 0
   end
   r
end

g(randn(4096)) is vectorlized by LLVM while f(randn(4096)) not. I tried rewriting f(a) in c, the output of Clang shows some simd IR (Although the simd width is only 2). Maybe related with #31862?

# We need to make sure `fskip` is stable.
noop = _fast_noop(op, _return_type(f, Tuple{nonmissingtype(eltype(A))}), v)
if isnothing(noop)
Comment on lines +358 to +359
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do both branches provide bit-identical results? Otherwise, relying on the inference result like this is not an optimization and it introduces unpredictable behavior.

Copy link
Member Author

@N5N3 N5N3 Jan 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For bit-identity, I test it locally and there seems no problem. (Edit: I mean a single operation, not the entire reduction, the result of sum should be different after we vectorlizing this loop.)
_return_type(f, Tuple{nonmissingtype(eltype(A))}) was used to make sure fskip(x) = ismissing(x) ? noop : f(x) is stable. (In this sence, I agree that we'd better fix this at LLVM level)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following seems also work for the original purpose:

noop = _fast_noop(op, typeof(v), v) # 
@simd for i = i:ilast
     @inbounds ai = A[i]
     v = op(v, ismissing(ai) ? oftype(v, noop) : f(ai))
end

But for inputs like Union{Int,Int32,Missing} there're about 5%~10% speed regression. Not sure is it Ok.

@simd for i = i:ilast
@inbounds ai = A[i]
if !ismissing(ai)
v = op(v, f(ai))
end
end
else
@inline fskip(x) = ismissing(x) ? noop : f(x)
@inbounds @simd for i = i:ilast
v = op(v, fskip(A[i]))
end
end
return Some(v)
Expand Down Expand Up @@ -463,4 +481,3 @@ macro coalesce(args...)
end
return esc(:(let val; $expr; end))
end