Add in-place multiply-add #738

bjack205 · 2020-02-19T00:27:41Z

This adds support for the new 5-argument in-place matrix multiplication in Julia v1.3 (#357 ).
We can probably consolidate the multiply-add and with the matrix multiply methods, but I've left them separate for now.

I've also added a quick "hack" to avoid an invalid pointer error (#734) when doing in-place multiplication with SizedArrays large enough that they get passed to BLAS.

coveralls · 2020-02-19T01:19:02Z

Coverage increased (+1.0%) to 82.969% when pulling c42c08c on bjack205:master into 7a78efd on JuliaArrays:master.

andyferris · 2020-02-19T11:13:36Z

Certainly seems like a pretty good idea - did you do any benchmarking comparing performance to naive multiply-then-add?

bjack205 · 2020-02-19T16:40:15Z

Yep, here's a few benchmarks

Test script:

julia> α,β = 2.0, 1.0
julia> N = 3 
julia> a = @MMatrix rand(N,N)
julia> b = @MMatrix rand(N,N)
julia> c = @MMatrix rand(N,N)
julia> A,B,C = Array(a), Array(b), Array(c)
julia> @btime mul!($C, $A, $B, 1.0, 1.0);
julia> @btime mul!($c, $a, $b, 1.0, 1.0);

And the results:

N = 3

Version	@Btime result
Base Array	16.470 ns (0 allocations: 0 bytes)
Master branch	11.839 ns (0 allocations: 0 bytes)
This version	8.847 ns (0 allocations: 0 bytes)

N = 5

Version	@Btime result
Base Array	121.191 ns (0 allocations: 0 bytes)
Master branch	182.495 ns (4 allocations: 224 bytes)
This version	22.558 ns (0 allocations: 0 bytes)

N = 20

Version	@Btime result
Base Array	581.967 ns (0 allocations: 0 bytes)
Master branch	4.618 μs (4 allocations: 224 bytes)
This version	575.928 ns (0 allocations: 0 bytes)

c42f

Thanks for writing this!

To be quite honest, I'm a bit worried about the amount of extra code this adds. I do wonder whether we can simply avoid a lot of this complexity by making use of the code we already have for unrolled matrix ops. Keep in mind that for tiny arrays the performance considerations are very different from large ones. For instance, copying often doesn't matter.

With that in mind, I tried the following one-liner:

function mymul!(C, A, B, α, β)
    C .= C.*β .+ α.*A*B
end

On my machine this seems reasonable for N=5 (master branch):

julia> @btime mul!($C, $A, $B, 1.0, 1.0);
  160.097 ns (0 allocations: 0 bytes)

julia> @btime mul!($c, $a, $b, 1.0, 1.0);
  267.894 ns (4 allocations: 224 bytes)

julia> @btime mymul!($c, $a, $b, 1.0, 1.0);
  42.355 ns (0 allocations: 0 bytes)

Please do try out a few simple things like this and see whether you can get competitive results. Adding large swadges of @generated code like this really does make it hard for us to maintain this library in the long term so we are strongly biased to simple solutions ;-)

Regardless of that, thanks a lot for what you've done so far! I hope we can iterate toward a solution which is both relatively short and high performance.

c42f · 2020-02-21T02:28:00Z

src/matrix_multiply.jl

+        Sa::Size{sa}, Sb::Size{sb},
+        a::Transpose{<:Any, <:StaticMatrix}, b::StaticMatrix{<:Any, <:Any, T},
+        α::Real=one(T), β::Real=zero(T)) where {s,sa,sb, T <: BlasFloat} =
+    BLAS.gemm!('T','N', α, a.parent, b, β , c)


This rather a lot of duplication which will be hard to maintain.

I think all you need is a utility function to "get the data pointer and transposed-ness" out of a, b, c? Then you can collapse all these down again into one function.

c42f · 2020-02-21T02:35:39Z

src/matrix_multiply_add.jl

+@inline mul!(dest::StaticVector, A::Transpose{<:Any, <:StaticMatrix}, B::StaticVector, α::Real, β::Real) =
+    _tmul!(Size(dest), dest, Size(A.parent), Size(B), A.parent, B, α, β)
+@inline mul!(dest::StaticMatrix, A::Transpose{<:Any, <:StaticMatrix}, B::StaticMatrix, α::Real, β::Real) =
+    _mul!(Size(dest), dest, Size(A), Size(B), A, B, α, β)


Is it possible to arrange this dispatch a bit more cleanly rather than having quite so much code duplication here?

For example, you can define a type alias similar to StaticMatrixLike which captures both StaticMatrix, Transpose (and possibly Adjoint).

Then a single overload of mul! may be sufficient.

bjack205 · 2020-03-06T18:07:02Z

Thanks for the feedback! This is my first PR so I certainly have a lot to learn!

I've rehashed my solution and made it shorter and hopefully more along the lines of what you're looking for. I've merged the 5-argument mul! with the 3-argument mul! and unified the way transposes are handled.

The transposes are particularly tricky since they tend to allocate at the slightest provocation. My solution avoids this allocation with a function barrier and instead caches a transpose flag in with the type (see TSize in matrix_multiply_add.jl). This ended up with a fairly straight-forward solution, IMHO.

The benchmarks are pretty much the same as they were before.
The three-argument multiplication doesn't incur any performance hit for small matrices, but for the medium-sized "chunked" version, my implementation runs a little slower.

bjack205 · 2020-03-06T19:54:28Z

I missed a test case that happens to be breaking pretty much everything. All my tests passed in magic numbers for alpha and beta, but when they're variables the MulAddMul allocates (due to type instability). I've tried a few things but can't figure out how to recover from the type instability. I'm not familiar enough with Julia to really understand why my implementation doesn't work while the one in LinearAlgebra/matmul.jl works, since they seem fairly similar to me. I'm open to any ideas on how to fix this, because otherwise this attempt isn't worth much.

bjack205 · 2020-03-07T17:24:26Z

Okay, I think it should be pretty close with the most recent commit. I hard-coded the MulAddMul for both the 5-argument and 3-argument versions to avoid type instability and resultant allocations.

NOTE THAT USING MulAddMul IS A BREAKING CHANGE, since this is a feature in Julia v1.3. I don't see this as a problem, since the 5-argument mul! was just made available in v1.3, anyways.

I also fixed the performance decline for the chunked unrolling, so the new version has improved or equivalent performance on all sizes.

I guess I should also give some motivation for why I think these methods are important. In my package TrajectoryOptimization.jl I've been battling getting the core algorithm to not allocate any memory. Normal SArrays work great, but we found that compilation times went through the roof when we tried larger problems. To my knowledge (which, admittedly, isn't all that much when it comes to Julia), SizedArray with mul! is the only solution that works nearly as fast as SArrays at small sizes (thanks to loop unrolling!), scales well to large arrays (thanks to BLAS), and doesn't incur any memory allocations.

Newest benchmarks:
test script:

function benchmark_matmul(N1,N2,ArrayType=MArray)
    if ArrayType <: MArray
        Mat = MMatrix
        Vec = MVector
    elseif ArrayType <: SizedArray
        Mat = SizedMatrix
        Vec = SizedVector
    end
    α,β = 1.0, 1.0
    A = rand(Mat{N1,N2})
    B = rand(Mat{N2,N2})
    C = rand(Mat{N1,N2})
    println("C = A*B")
    @btime mul!($C,$A,$B)
    println("C += A*B")
    @btime mul!($C,$A,$B,$α,$β)
    println("B = A'C")
    @btime mul!($B,Transpose($A),$C)
    println("B += A'C")
    @btime mul!($B,Transpose($A),$C,$α,$β)
end

`MArray`

Size	branch	`C = A*B`	`C += A*B`	`B = A'B`	`B += A'B`
3x3	master	6.6 ns (0 allocs)	193.4 ns (3 allocs)	9.2 ns (0 allocs)	193.8 ns (3 allocs)
3x3	new	6.7 ns (0 allocs)	8.1 ns (0 allocs)	6.6 ns (0 allocs)	8.1 ns (0 allocs)
10x10	master	123.3 ns (0 allocs)	889.5 ns (7 allocs)	489.6 ns (0 allocs)	683.9 ns (3 allocs)
10x10	new	124.1 ns (0 allocs)	134.1 ns (0 allocs)	196.9 ns (0 allocs)	207.2 ns (3 allocs)
20x20	master	613.7 ns (0 allocs)	4.83 μs (7 allocs)	3.76 μs (0 allocs)	4.21 μs (3 allocs)
20x20	new	613.7 ns (0 allocs)	580.9 ns (0 allocs)	613.4 ns (0 allocs)	591 ns (3 allocs)

`SizedArray`

Size	branch	`C = A*B`	`C += A*B`	`B = A'B`	`B += A'B`
3x3	master	10 ns (0 allocs)	195 ns (3 allocs)	10 ns (0 allocs)	194 ns (0 allocs)
3x3	new	10 ns (0 allocs)	11 ns (0 allocs)	11 ns (0 allocs)	11 ns (0 allocs)
10x10	master	119 (0 allocs)	905 ns (7 allocs)	406 ns (0 allocs)	680 ns (3 allocs)
10x10	new	125 ns (0 allocs)	129 ns (0 allocs)	195 ns (0 allocs)	199 ns (0 allocs)
20x20	master	error	error	error	error
20x20	new	617 ns (0 allocs)	581 ns (0 allocs)	612 ns (0 allocs)	576 ns (0 allocs)

src/matrix_multiply_add.jl

bjack205 · 2020-03-25T15:06:10Z

I fixed the backwards-compatibility problem. Are there further changes that need to be made?

c42f · 2020-04-14T01:46:50Z

src/matrix_multiply_add.jl

+        a::StaticMatrix, b::StaticMatrix, _add::MulAddMul) =
+    mul_blas!(transpose(Sc), c, transpose(Sb), transpose(Sa), b, a, _add)
+
+# TODO: Get this version of mul_blas! working so there's backward-compatibility with older Julia versions


Is this TODO still needed?

Nope. Fixed and deleted out the old code.

c42f · 2020-04-14T01:55:03Z

test/matrix_multiply_add.jl

+    @test b ≈ 5A'c
+
+    bmark = @benchmark mul!($c,$A,$b) samples=10 evals=10
+    @test minimum(bmark).allocs == 0


Is it possible to use @allocated or does that not work properly? (I know we've had some problems with it lately related to changes in Base.)

I couldn't get it to not record allocations with @allocated. I'm guessing @benchmark will be more accurate? I've added commented out lines using @test_noalloc below each test, as well as the allocations recorded.

c42f · 2020-04-14T01:59:42Z

Oops it looks like I caused a conflict by merging some unrelated test tools. I merged into your branch which will hopefully fix that up.

Generally this is looking quite good now (a lot less scary in terms of amount of code added!)

Great job with the table of benchmarks there's some really impressive improvements there. Could you add the benchmarks to benchmark directory so that we retain a copy of them and run them in the future? You can see the results of comparing benchmarks across versions in the github actions "Benchmark" job. There's other examples there of how to hook them up.

c42f · 2020-04-14T02:00:22Z

test/matrix_multiply_add.jl

+#     println("B = A'C + B")
+#     @btime mul!($B,Transpose($A),$C,$α,$β)
+# end
+# benchmark_matmul(20,20,SizedArray)


Let's move this into the benchmarks directory so it'll be kept up to date.

c42f · 2020-04-15T01:32:20Z

test/matrix_multiply_add.jl

+        $ex
+        @test(@allocated($ex) == 0)
+    end)
+end


If you've found @benchmark reliable for this, let's just make test_noalloc use that instead? Something like:

macro test_noalloc(ex) quote bmark = @benchmark $(esc(ex)) samples=10 evals=10 @test minimum(bmark).allocs == 0 end end

You could put this in test_util as well and we can use it everywhere instead of @allocated? Couple of questions:

We don't actually want to do a benchmark here so it seems a bit heavy weight. We can put up with that though, unless it slows down the tests a lot?

Can you reduce the samples and evals?

Is the minimum there required, or will anything other than the maximum do? (Are we measuring the right thing?)

Ugh, there's something really broken about the way Julia or BenchmarkTools processes the correct-seeming use of esc in the nested macro above.

Also I didn't realize quite what minimum does; how about this instead:

macro test_noalloc(ex) ex = :(@test minimum(@benchmark($ex, samples=10, evals=10)).allocs == 0) ex.args[2] = __source__ # trick: make `@test` aware of the source location of `@test_noalloc` esc(ex) end

@test_noalloc macro sounds like a good idea. We can swap @benchmark with something else if these tests slow down the whole test suite a lot. My guess is that it'd be OK as long as you have something like samples=10 evals=10.

Is the minimum there required

bmark is the Trial whose .allocs field is already an Int:

https://github.com/JuliaCI/BenchmarkTools.jl/blob/cf7cada0b3f41a00c70ef03891ffae29c31f8573/src/trials.jl#L5-L11

It seems that Base.push!(t::Trial, time, gctime, memory, allocs) (which is used to accumulate the benchmark trial results) already takes maximum over all the trials:

https://github.com/JuliaCI/BenchmarkTools.jl/blob/cf7cada0b3f41a00c70ef03891ffae29c31f8573/src/trials.jl#L29

(Also, minimum is the minimum over the wall times. So I don't think we need it anyway.)

Your suggested test_noalloc(ex) macro didn't work for me? When I use it within a function (not in global scope) it can't find the local arguments.

Try put a $ in front of the local arguments (it's a feature of @benchmark to make sure that it captures them properly).

Your suggested test_noalloc(ex) macro didn't work for me?

Andy's suggestion seems to work. But oh wow, I had a look at BenchmarkTools.generate_benchmark_definition, it's kind of a horror show 😬 it calls Core.eval in the macro and really seems to go out of its way to subvert the usual macro expansion rules in a weird way.

Maybe it has to be that way for some reason, but it sure is doing some weird stuff.

If it's too hard to wrap @benchmark reliably I'm completely ok with just having it written out by hand in a few places.

I tried looking more into why @allocated is failing but it quickly led to a very deep rabbit hole which I'm still trying to climb out of.

c42f · 2020-04-17T09:26:25Z

I went ahead and merged this, we can sort the @allocated mysteries out another time.

Thanks for a very nice contribution!

Add in-place multiply-add

c42c08c

bjack205 added 2 commits February 19, 2020 15:17

Add in-place A'B

f2ba7b0

Fix typo

35389df

c42f reviewed Feb 21, 2020

View reviewed changes

Unify in-place multiplication methods

be0f54d

Hard-code MulAddMul and revert chunked muladd

027ad49

tkf reviewed Mar 7, 2020

View reviewed changes

src/matrix_multiply_add.jl Outdated Show resolved Hide resolved

c42f added feature features and feature requests linear-algebra labels Mar 17, 2020

bjack205 added 2 commits March 23, 2020 21:28

Merge branch 'master' of https://github.com/JuliaArrays/StaticArrays.jl

3c6289b

Don't use LinearAlgebra MulAddMul

fe44f2d

Merge branch 'master' into master

00dfa80

c42f reviewed Apr 14, 2020

View reviewed changes

c42f added this to the v0.12.2 milestone Apr 14, 2020

c42f added the performance runtime performance label Apr 14, 2020

bjack205 added 2 commits April 14, 2020 09:58

Add multiply-add benchmarks

651ddba

Delete out commented code

11eae73

c42f reviewed Apr 15, 2020

View reviewed changes

c42f merged commit 10c001d into JuliaArrays:master Apr 17, 2020

This was referenced Apr 17, 2020

Matrix Multiplication for large SizedArray #734

Closed

5 argument mul! #735

Closed

Add in-place multiply-add #738

Add in-place multiply-add #738

Uh oh!

Conversation

bjack205 commented Feb 19, 2020

Uh oh!

coveralls commented Feb 19, 2020

Uh oh!

andyferris commented Feb 19, 2020

Uh oh!

bjack205 commented Feb 19, 2020

N = 3

N = 5

N = 20

Uh oh!

c42f left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bjack205 commented Mar 6, 2020

Uh oh!

bjack205 commented Mar 6, 2020

Uh oh!

bjack205 commented Mar 7, 2020

MArray

SizedArray

Uh oh!

Uh oh!

bjack205 commented Mar 25, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

c42f commented Apr 14, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

c42f commented Apr 17, 2020

Uh oh!

Uh oh!

c42f left a comment •

edited

Loading

`MArray`

`SizedArray`