Description
I have some comments regarding correctness issues with minimum
and maximum
. One is about handling of signed zero for floating points and the other about short circuiting for NaN
and missing
. I believe the current implementation favours performance over correctness, something I think one should be extremely careful with.
Before discussing the issues I want to give some context about the minimum
and maximum
functions and why I think it is extra important that these methods are correctly implemented. The call minimum(A::AbstractArray)
eventually calls mapreduce(identity, min, A)
, similarly minimum(f, A)
eventually calls mapreduce(f, min, A)
. After some more indirection this calls Base._mapreduce(f, min, IndexLinear(), A)
, this method has some special cases if the length of A
is small, more precisely it handles length(A) < 16
directly, otherwise it calls Base.mapreduce_impl(f, min, A, firstindex(A), lastindex(A))
. So far everything uses the default implementations of the reduction functions, there is no special handling of the min
argument, the Base.mapreduce_impl
function does however have a specialised method for min
and max
found here.
The above indirection leads to three potential issues
- The implementation used for the reduction depends on the length of the vector. This makes it easy to miss bugs or correctness issues since it is common to only write tests for short vectors.
- It is impossible to dispatch on the return type of
f
. This means that ifBase.mapreduce_impl(f, min, A, firstindex(A), lastindex(A))
makes some optimizations that are not valid for your custom typeT
there is no way to handle this. For most Julia functions you can create a custom method by dispatching on your type, but here this is not possible. This calls for being extra careful that the implementation is correct in all possible cases. - It might be possible to document any possible correctness issues for
minimum
andmaximum
. However, since they callmapreduce(f, min, A)
you would get the same correctness issues if you for example callreduce(min, A)
and it is not really possible to add documentation toreduce
andmapreduce
about special handing ofmin
andmax
. Again this calls for being extra careful with correctness since it is not possible to document it fully.
Now to the actual issues. I give all examples using minimum
but maximum
has exactly the same issues. Whenever I refer to timings it is for a AMD Ryzen 9 5900X.
Handling of signed zeros
The main optimization done in the implementation of Base.mapreduce_impl(f, min, A, firstindex(A), lastindex(A))
is in the way it handles signed zeros. It calls _fast(min, x, y)
which is defined by
_fast(::typeof(min),x,y) = min(x,y)
function _fast(::typeof(min),x::AbstractFloat, y::AbstractFloat)
ifelse(isnan(x),
x,
ifelse(x < y, x, y))
end
so in general it calls min
directly and there is nothing to discuss. But if x
and y
are floats it uses a simplified version of min
which doesn't take into account the ordering of -0.0
and 0.0
. To still enforce the ordering of the signed zeros it ends with
# enforce correct order of 0.0 and -0.0
# e.g. maximum([0.0, -0.0]) === 0.0
# should hold
if isbadzero(op, v)
for i in first:last
x = @inbounds A[i]
isgoodzero(op,x) && return x
end
end
However the issue with this code is that it doesn't apply the function f
to x
before comparison, so it only works correctly for f = identity
(or as long as f
preserves the sign I guess). For example we have
julia> minimum(abs, fill(-0.0, 16))
-0.0
julia> minimum(abs.(fill(-0.0, 16)))
0.0
Note that if you give fewer than 16 elements it uses another implementation and you get a correct result (so the example maximum([0.0, -0.0]) === 0.0
in the comment doesn't actually use the code below the comment), which makes it harder to track down this issue.
julia> minimum(abs, fill(-0.0, 15))
0.0
What I can tell there is no simple fix for this which doesn't sacrifice performance. One possibility is to add the application of f
to the check for -0.0
. It would give correct results but would mean applying f
twice on each element, which I don't think is an option. The most "correct" solution would be to avoid using the _fast
method and just use min
directly. However this comes at a fairly large performance penalty compared to the current implementation, about a 100% increasing in runtime for Float64
in some benchmarks I did. Another, slightly more involved, solution is to special case on f = identity
and apply this optimization for that case but not otherwise.
Personally I would lean to the correct but slower solution, avoid using _fast
and call min
directly. But I don't know what type of performance regressions people are willing to accept.
Lastly I can also mention that the current version gives a quite noticeable performance penalty if the minimum happens to be 0.0
, since the array is traversed twice.
julia> A1 = rand(10^6);
julia> A2 = [0.0; rand(10^6 - 1)];
julia> @btime minimum($A1)
174.130 μs (0 allocations: 0 bytes)
1.9306038734345776e-7
julia> @btime minimum($A2)
412.420 μs (0 allocations: 0 bytes)
0.0
Short circuiting for NaN
and missing
The implementation has a short circuit meant for when it encounters NaN
or missing
. The relevant code is
while simdstop <= last - 3
# short circuit in case of NaN or missing
(v1 == v1) === true || return v1
(v2 == v2) === true || return v2
(v3 == v3) === true || return v3
(v4 == v4) === true || return v4
@inbounds for i in start:4:simdstop
v1 = _fast(op, v1, f(A[i+0]))
v2 = _fast(op, v2, f(A[i+1]))
v3 = _fast(op, v3, f(A[i+2]))
v4 = _fast(op, v4, f(A[i+3]))
end
checkbounds(A, simdstop+3)
start += chunk_len
simdstop += chunk_len
end
The benefit of this is that we can avoid looking through the whole vector in some cases. However it has some issues. The main issue is that it short circuits on BOTH NaN
and missing
and hence can return NaN
in cases where it would be more correct to return missing
, exactly which one is returned is also slightly involved which makes it very hard to test for this case. Some examples
julia> # Result depends on order of NaN and missing
julia> A1 = vcat([NaN, missing], fill(0.0, 255));
julia> minimum(A1)
NaN
julia> A2 = vcat([missing, NaN], fill(0.0, 255));
julia> minimum(A2)
missing
julia> # Result depends on length of vector
julia> A3 = vcat([0.0, 0.0, missing, 0.0, 0.0, NaN], fill(0.0, 506));
julia> minimum(A3)
missing
julia> A4 = vcat([0.0, 0.0, missing, 0.0, 0.0, NaN], fill(0.0, 507));
julia> minimum(A4)
NaN
Here I think the obvious solution is to just remove this short circuiting. As mentioned in #35939 it seems unlikely that this is used in the hot path in many cases. For the vast majority of cases this should not lead to any performance reductions, it might even lead to very slight performance improvements. If one really wants to keep this behaviour then one option could be to only short circuit on missing
, in which case the behaviour would be correct. If one really wants to I guess it might be possible to check if the return type of f
contains missing
or not and short circuit on NaN
if it does not. In that case the check should be updated to use isnan
. But as I said I would opt towards just removing the short circuiting altogether.
Behaviour for other types
Finally I wanted to give some observations on the behaviour for other types. The reason I took a look at this function from the beginning was due to issues I encountered when developing Arblib.jl.
For AbstractFloat
it uses ifelse(isnan(x), x, ifelse(x < y, x, y))
instead of min
directly. Except for not handling signed zeros this is equivalent for Float64
(and most other AbstractFloat
in Base). For BigFloat
it has the benefit of not allocating (which min
does). In fact I don't know why min(x::BigFloat, y::BigFloat)
is written to allocate a new variable for the result, in contrast the implementation of min(x::BigInt, y::BigInt)
does not allocate a new variable.
My issue with using this approach instead of min
directly is that it is not valid for the Arb
type that occurs in Arblib.jl. The Arb
type can slightly simplified be described as a floating point with some extra information keeping track of rounding errors, when taking the minimum these rounding errors have to be taken into account and this is not done with the above check. Now this might very well be my own fault for subtyping AbstractFloat
even though the type doesn't satisfy the assumptions assumed by Julia, but in many cases it is very natural to treat an Arb
as a floating point and see the tracking of the rounding errors as a bonus. In most cases you can then just dispatch on the Arb
type when you need some different implementation, but as mentioned in the beginning this is not possible for minimum
.
The second observation is related to the current implementation of the short circuiting. It currently assumes that (x == x) === true
is false only for NaN
and missing
. While this is true for all types in base Julia I don't think it should be assumed for custom types. For example the Arb
type mentioned above doesn't satisfy this assumption (x == x
is false if the rounding error is non-zero). It would be better to be more explicit and instead use something along the like of(x isa Number && isnan(x)) || ismissing(x)
as mentioned in #35989, this should also be slightly faster in many cases. Though my stance would be that the short circuiting should be removed entirely.
What to do now?
I would be happy to prepare a PR to handle some of these issues. But exactly what approach to take depends on how much performance one is willing to sacrifice or how much code complexity one is willing to add. Below are some possible solutions, I can add more benchmark information if it is of interest.
The simplest possible implementation which handles all of the above mentioned issues would be
function mapreduce_impl(
f,
op::Union{typeof(max),typeof(min)},
A::AbstractArrayOrBroadcasted,
first::Int,
last::Int,
)
@inbounds v1 = mapreduce_first(f, op, A[first])
v2 = v3 = v4 = v1
@inbounds indices = (first+1):4:(last-3)
for i = indices
v1 = op(v1, f(A[i+0]))
v2 = op(v2, f(A[i+1]))
v3 = op(v3, f(A[i+2]))
v4 = op(v4, f(A[i+3]))
end
v = op(op(v1, v2), op(v3, v4))
@inbounds for i = (indices[end]+3):last
v = op(v, f(A[i]))
end
return v
end
The drawbacks of this approach is no short circuiting and worse performance for floating points. For Float64
it is about 100% slower. For BigFloat
it is about 300% slower due to min
allocating whereas the previous version didn't allocate. The performance regression for BigFloat
I think is acceptable, if you want to make it faster you can use MutableArithmetic.jl. The performance penalty for Float64
I'm not sure if is okay or not.
A solution which solves some of the issues but doesn't pay much in terms of performance would be to replace the op
in the above implementation with the _fast
version. This would mean that it doesn't handle signed zeros correctly but at least it handles NaN
and missing
correctly. The only performance regression would be the removal of the short circuiting. In this case a note should probably be added to minimum
about not handling signed zeros.
Finally, the most complicated solution would be to special case on f = identity
and in that case use _fast
and explicitly check for the result being zero in the end, falling back to min
if f
is not identity
.
None of the solutions keep the short circuiting, this I think is a good trade of in terms of correctness versus performance. But others might think differently.