Skip to content

Add in-place multiply-add #738

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Apr 17, 2020
Merged

Add in-place multiply-add #738

merged 10 commits into from
Apr 17, 2020

Conversation

bjack205
Copy link
Contributor

This adds support for the new 5-argument in-place matrix multiplication in Julia v1.3 (#357 ).
We can probably consolidate the multiply-add and with the matrix multiply methods, but I've left them separate for now.

I've also added a quick "hack" to avoid an invalid pointer error (#734) when doing in-place multiplication with SizedArrays large enough that they get passed to BLAS.

@coveralls
Copy link

Coverage Status

Coverage increased (+1.0%) to 82.969% when pulling c42c08c on bjack205:master into 7a78efd on JuliaArrays:master.

@andyferris
Copy link
Member

Certainly seems like a pretty good idea - did you do any benchmarking comparing performance to naive multiply-then-add?

@bjack205
Copy link
Contributor Author

Yep, here's a few benchmarks

Test script:

julia> α,β = 2.0, 1.0
julia> N = 3 
julia> a = @MMatrix rand(N,N)
julia> b = @MMatrix rand(N,N)
julia> c = @MMatrix rand(N,N)
julia> A,B,C = Array(a), Array(b), Array(c)
julia> @btime mul!($C, $A, $B, 1.0, 1.0);
julia> @btime mul!($c, $a, $b, 1.0, 1.0);

And the results:

N = 3

Version @Btime result
Base Array 16.470 ns (0 allocations: 0 bytes)
Master branch 11.839 ns (0 allocations: 0 bytes)
This version 8.847 ns (0 allocations: 0 bytes)

N = 5

Version @Btime result
Base Array 121.191 ns (0 allocations: 0 bytes)
Master branch 182.495 ns (4 allocations: 224 bytes)
This version 22.558 ns (0 allocations: 0 bytes)

N = 20

Version @Btime result
Base Array 581.967 ns (0 allocations: 0 bytes)
Master branch 4.618 μs (4 allocations: 224 bytes)
This version 575.928 ns (0 allocations: 0 bytes)

Copy link
Member

@c42f c42f left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for writing this!

To be quite honest, I'm a bit worried about the amount of extra code this adds. I do wonder whether we can simply avoid a lot of this complexity by making use of the code we already have for unrolled matrix ops. Keep in mind that for tiny arrays the performance considerations are very different from large ones. For instance, copying often doesn't matter.

With that in mind, I tried the following one-liner:

function mymul!(C, A, B, α, β)
    C .= C.*β .+ α.*A*B
end

On my machine this seems reasonable for N=5 (master branch):

julia> @btime mul!($C, $A, $B, 1.0, 1.0);
  160.097 ns (0 allocations: 0 bytes)

julia> @btime mul!($c, $a, $b, 1.0, 1.0);
  267.894 ns (4 allocations: 224 bytes)

julia> @btime mymul!($c, $a, $b, 1.0, 1.0);
  42.355 ns (0 allocations: 0 bytes)

Please do try out a few simple things like this and see whether you can get competitive results. Adding large swadges of @generated code like this really does make it hard for us to maintain this library in the long term so we are strongly biased to simple solutions ;-)

Regardless of that, thanks a lot for what you've done so far! I hope we can iterate toward a solution which is both relatively short and high performance.

Sa::Size{sa}, Sb::Size{sb},
a::Transpose{<:Any, <:StaticMatrix}, b::StaticMatrix{<:Any, <:Any, T},
α::Real=one(T), β::Real=zero(T)) where {s,sa,sb, T <: BlasFloat} =
BLAS.gemm!('T','N', α, a.parent, b, β , c)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This rather a lot of duplication which will be hard to maintain.

I think all you need is a utility function to "get the data pointer and transposed-ness" out of a, b, c? Then you can collapse all these down again into one function.

@inline mul!(dest::StaticVector, A::Transpose{<:Any, <:StaticMatrix}, B::StaticVector, α::Real, β::Real) =
_tmul!(Size(dest), dest, Size(A.parent), Size(B), A.parent, B, α, β)
@inline mul!(dest::StaticMatrix, A::Transpose{<:Any, <:StaticMatrix}, B::StaticMatrix, α::Real, β::Real) =
_mul!(Size(dest), dest, Size(A), Size(B), A, B, α, β)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to arrange this dispatch a bit more cleanly rather than having quite so much code duplication here?

For example, you can define a type alias similar to StaticMatrixLike which captures both StaticMatrix, Transpose (and possibly Adjoint).

Then a single overload of mul! may be sufficient.

@bjack205
Copy link
Contributor Author

bjack205 commented Mar 6, 2020

Thanks for the feedback! This is my first PR so I certainly have a lot to learn!

I've rehashed my solution and made it shorter and hopefully more along the lines of what you're looking for. I've merged the 5-argument mul! with the 3-argument mul! and unified the way transposes are handled.

The transposes are particularly tricky since they tend to allocate at the slightest provocation. My solution avoids this allocation with a function barrier and instead caches a transpose flag in with the type (see TSize in matrix_multiply_add.jl). This ended up with a fairly straight-forward solution, IMHO.

The benchmarks are pretty much the same as they were before.
The three-argument multiplication doesn't incur any performance hit for small matrices, but for the medium-sized "chunked" version, my implementation runs a little slower.

@bjack205
Copy link
Contributor Author

bjack205 commented Mar 6, 2020

I missed a test case that happens to be breaking pretty much everything. All my tests passed in magic numbers for alpha and beta, but when they're variables the MulAddMul allocates (due to type instability). I've tried a few things but can't figure out how to recover from the type instability. I'm not familiar enough with Julia to really understand why my implementation doesn't work while the one in LinearAlgebra/matmul.jl works, since they seem fairly similar to me. I'm open to any ideas on how to fix this, because otherwise this attempt isn't worth much.

@bjack205
Copy link
Contributor Author

bjack205 commented Mar 7, 2020

Okay, I think it should be pretty close with the most recent commit. I hard-coded the MulAddMul for both the 5-argument and 3-argument versions to avoid type instability and resultant allocations.

NOTE THAT USING MulAddMul IS A BREAKING CHANGE, since this is a feature in Julia v1.3. I don't see this as a problem, since the 5-argument mul! was just made available in v1.3, anyways.

I also fixed the performance decline for the chunked unrolling, so the new version has improved or equivalent performance on all sizes.

I guess I should also give some motivation for why I think these methods are important. In my package TrajectoryOptimization.jl I've been battling getting the core algorithm to not allocate any memory. Normal SArrays work great, but we found that compilation times went through the roof when we tried larger problems. To my knowledge (which, admittedly, isn't all that much when it comes to Julia), SizedArray with mul! is the only solution that works nearly as fast as SArrays at small sizes (thanks to loop unrolling!), scales well to large arrays (thanks to BLAS), and doesn't incur any memory allocations.

Newest benchmarks:
test script:

function benchmark_matmul(N1,N2,ArrayType=MArray)
    if ArrayType <: MArray
        Mat = MMatrix
        Vec = MVector
    elseif ArrayType <: SizedArray
        Mat = SizedMatrix
        Vec = SizedVector
    end
    α,β = 1.0, 1.0
    A = rand(Mat{N1,N2})
    B = rand(Mat{N2,N2})
    C = rand(Mat{N1,N2})
    println("C = A*B")
    @btime mul!($C,$A,$B)
    println("C += A*B")
    @btime mul!($C,$A,$B,$α,$β)
    println("B = A'C")
    @btime mul!($B,Transpose($A),$C)
    println("B += A'C")
    @btime mul!($B,Transpose($A),$C,$α,$β)
end

MArray

Size branch C = A*B C += A*B B = A'B B += A'B
3x3 master 6.6 ns (0 allocs) 193.4 ns (3 allocs) 9.2 ns (0 allocs) 193.8 ns (3 allocs)
3x3 new 6.7 ns (0 allocs) 8.1 ns (0 allocs) 6.6 ns (0 allocs) 8.1 ns (0 allocs)
10x10 master 123.3 ns (0 allocs) 889.5 ns (7 allocs) 489.6 ns (0 allocs) 683.9 ns (3 allocs)
10x10 new 124.1 ns (0 allocs) 134.1 ns (0 allocs) 196.9 ns (0 allocs) 207.2 ns (3 allocs)
20x20 master 613.7 ns (0 allocs) 4.83 μs (7 allocs) 3.76 μs (0 allocs) 4.21 μs (3 allocs)
20x20 new 613.7 ns (0 allocs) 580.9 ns (0 allocs) 613.4 ns (0 allocs) 591 ns (3 allocs)

SizedArray

Size branch C = A*B C += A*B B = A'B B += A'B
3x3 master 10 ns (0 allocs) 195 ns (3 allocs) 10 ns (0 allocs) 194 ns (0 allocs)
3x3 new 10 ns (0 allocs) 11 ns (0 allocs) 11 ns (0 allocs) 11 ns (0 allocs)
10x10 master 119 (0 allocs) 905 ns (7 allocs) 406 ns (0 allocs) 680 ns (3 allocs)
10x10 new 125 ns (0 allocs) 129 ns (0 allocs) 195 ns (0 allocs) 199 ns (0 allocs)
20x20 master error error error error
20x20 new 617 ns (0 allocs) 581 ns (0 allocs) 612 ns (0 allocs) 576 ns (0 allocs)

@c42f c42f added feature features and feature requests linear-algebra labels Mar 17, 2020
@bjack205
Copy link
Contributor Author

I fixed the backwards-compatibility problem. Are there further changes that need to be made?

a::StaticMatrix, b::StaticMatrix, _add::MulAddMul) =
mul_blas!(transpose(Sc), c, transpose(Sb), transpose(Sa), b, a, _add)

# TODO: Get this version of mul_blas! working so there's backward-compatibility with older Julia versions
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this TODO still needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope. Fixed and deleted out the old code.

@test b ≈ 5A'c

bmark = @benchmark mul!($c,$A,$b) samples=10 evals=10
@test minimum(bmark).allocs == 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to use @allocated or does that not work properly? (I know we've had some problems with it lately related to changes in Base.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't get it to not record allocations with @allocated. I'm guessing @benchmark will be more accurate? I've added commented out lines using @test_noalloc below each test, as well as the allocations recorded.

@c42f
Copy link
Member

c42f commented Apr 14, 2020

Oops it looks like I caused a conflict by merging some unrelated test tools. I merged into your branch which will hopefully fix that up.

Generally this is looking quite good now (a lot less scary in terms of amount of code added!)

Great job with the table of benchmarks there's some really impressive improvements there. Could you add the benchmarks to benchmark directory so that we retain a copy of them and run them in the future? You can see the results of comparing benchmarks across versions in the github actions "Benchmark" job. There's other examples there of how to hook them up.

# println("B = A'C + B")
# @btime mul!($B,Transpose($A),$C,$α,$β)
# end
# benchmark_matmul(20,20,SizedArray)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move this into the benchmarks directory so it'll be kept up to date.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@c42f c42f added this to the v0.12.2 milestone Apr 14, 2020
@c42f c42f added the performance runtime performance label Apr 14, 2020
$ex
@test(@allocated($ex) == 0)
end)
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you've found @benchmark reliable for this, let's just make test_noalloc use that instead? Something like:

macro test_noalloc(ex)
    quote
        bmark = @benchmark $(esc(ex)) samples=10 evals=10
        @test minimum(bmark).allocs == 0
    end
end

You could put this in test_util as well and we can use it everywhere instead of @allocated? Couple of questions:

  • We don't actually want to do a benchmark here so it seems a bit heavy weight. We can put up with that though, unless it slows down the tests a lot?
  • Can you reduce the samples and evals?
  • Is the minimum there required, or will anything other than the maximum do? (Are we measuring the right thing?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh, there's something really broken about the way Julia or BenchmarkTools processes the correct-seeming use of esc in the nested macro above.

Also I didn't realize quite what minimum does; how about this instead:

macro test_noalloc(ex)
    ex = :(@test minimum(@benchmark($ex, samples=10, evals=10)).allocs == 0)
    ex.args[2] = __source__  # trick: make `@test` aware of the source location of `@test_noalloc`
    esc(ex)
end

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@test_noalloc macro sounds like a good idea. We can swap @benchmark with something else if these tests slow down the whole test suite a lot. My guess is that it'd be OK as long as you have something like samples=10 evals=10.

Is the minimum there required

bmark is the Trial whose .allocs field is already an Int:

https://github.com/JuliaCI/BenchmarkTools.jl/blob/cf7cada0b3f41a00c70ef03891ffae29c31f8573/src/trials.jl#L5-L11

It seems that Base.push!(t::Trial, time, gctime, memory, allocs) (which is used to accumulate the benchmark trial results) already takes maximum over all the trials:

https://github.com/JuliaCI/BenchmarkTools.jl/blob/cf7cada0b3f41a00c70ef03891ffae29c31f8573/src/trials.jl#L29

(Also, minimum is the minimum over the wall times. So I don't think we need it anyway.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your suggested test_noalloc(ex) macro didn't work for me? When I use it within a function (not in global scope) it can't find the local arguments.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try put a $ in front of the local arguments (it's a feature of @benchmark to make sure that it captures them properly).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your suggested test_noalloc(ex) macro didn't work for me?

Andy's suggestion seems to work. But oh wow, I had a look at BenchmarkTools.generate_benchmark_definition, it's kind of a horror show 😬 it calls Core.eval in the macro and really seems to go out of its way to subvert the usual macro expansion rules in a weird way.

Maybe it has to be that way for some reason, but it sure is doing some weird stuff.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's too hard to wrap @benchmark reliably I'm completely ok with just having it written out by hand in a few places.

I tried looking more into why @allocated is failing but it quickly led to a very deep rabbit hole which I'm still trying to climb out of.

@c42f c42f merged commit 10c001d into JuliaArrays:master Apr 17, 2020
@c42f
Copy link
Member

c42f commented Apr 17, 2020

I went ahead and merged this, we can sort the @allocated mysteries out another time.

Thanks for a very nice contribution!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature features and feature requests linear-algebra performance runtime performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants