-
Notifications
You must be signed in to change notification settings - Fork 151
Add in-place multiply-add #738
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Certainly seems like a pretty good idea - did you do any benchmarking comparing performance to naive multiply-then-add? |
Yep, here's a few benchmarks Test script: julia> α,β = 2.0, 1.0
julia> N = 3
julia> a = @MMatrix rand(N,N)
julia> b = @MMatrix rand(N,N)
julia> c = @MMatrix rand(N,N)
julia> A,B,C = Array(a), Array(b), Array(c)
julia> @btime mul!($C, $A, $B, 1.0, 1.0);
julia> @btime mul!($c, $a, $b, 1.0, 1.0); And the results: N = 3
N = 5
N = 20
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for writing this!
To be quite honest, I'm a bit worried about the amount of extra code this adds. I do wonder whether we can simply avoid a lot of this complexity by making use of the code we already have for unrolled matrix ops. Keep in mind that for tiny arrays the performance considerations are very different from large ones. For instance, copying often doesn't matter.
With that in mind, I tried the following one-liner:
function mymul!(C, A, B, α, β)
C .= C.*β .+ α.*A*B
end
On my machine this seems reasonable for N=5 (master branch):
julia> @btime mul!($C, $A, $B, 1.0, 1.0);
160.097 ns (0 allocations: 0 bytes)
julia> @btime mul!($c, $a, $b, 1.0, 1.0);
267.894 ns (4 allocations: 224 bytes)
julia> @btime mymul!($c, $a, $b, 1.0, 1.0);
42.355 ns (0 allocations: 0 bytes)
Please do try out a few simple things like this and see whether you can get competitive results. Adding large swadges of @generated
code like this really does make it hard for us to maintain this library in the long term so we are strongly biased to simple solutions ;-)
Regardless of that, thanks a lot for what you've done so far! I hope we can iterate toward a solution which is both relatively short and high performance.
src/matrix_multiply.jl
Outdated
Sa::Size{sa}, Sb::Size{sb}, | ||
a::Transpose{<:Any, <:StaticMatrix}, b::StaticMatrix{<:Any, <:Any, T}, | ||
α::Real=one(T), β::Real=zero(T)) where {s,sa,sb, T <: BlasFloat} = | ||
BLAS.gemm!('T','N', α, a.parent, b, β , c) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This rather a lot of duplication which will be hard to maintain.
I think all you need is a utility function to "get the data pointer and transposed-ness" out of a
, b
, c
? Then you can collapse all these down again into one function.
src/matrix_multiply_add.jl
Outdated
@inline mul!(dest::StaticVector, A::Transpose{<:Any, <:StaticMatrix}, B::StaticVector, α::Real, β::Real) = | ||
_tmul!(Size(dest), dest, Size(A.parent), Size(B), A.parent, B, α, β) | ||
@inline mul!(dest::StaticMatrix, A::Transpose{<:Any, <:StaticMatrix}, B::StaticMatrix, α::Real, β::Real) = | ||
_mul!(Size(dest), dest, Size(A), Size(B), A, B, α, β) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to arrange this dispatch a bit more cleanly rather than having quite so much code duplication here?
For example, you can define a type alias similar to StaticMatrixLike
which captures both StaticMatrix
, Transpose
(and possibly Adjoint
).
Then a single overload of mul!
may be sufficient.
Thanks for the feedback! This is my first PR so I certainly have a lot to learn! I've rehashed my solution and made it shorter and hopefully more along the lines of what you're looking for. I've merged the 5-argument The transposes are particularly tricky since they tend to allocate at the slightest provocation. My solution avoids this allocation with a function barrier and instead caches a transpose flag in with the type (see The benchmarks are pretty much the same as they were before. |
I missed a test case that happens to be breaking pretty much everything. All my tests passed in magic numbers for alpha and beta, but when they're variables the |
Okay, I think it should be pretty close with the most recent commit. I hard-coded the NOTE THAT USING I also fixed the performance decline for the chunked unrolling, so the new version has improved or equivalent performance on all sizes. I guess I should also give some motivation for why I think these methods are important. In my package Newest benchmarks: function benchmark_matmul(N1,N2,ArrayType=MArray)
if ArrayType <: MArray
Mat = MMatrix
Vec = MVector
elseif ArrayType <: SizedArray
Mat = SizedMatrix
Vec = SizedVector
end
α,β = 1.0, 1.0
A = rand(Mat{N1,N2})
B = rand(Mat{N2,N2})
C = rand(Mat{N1,N2})
println("C = A*B")
@btime mul!($C,$A,$B)
println("C += A*B")
@btime mul!($C,$A,$B,$α,$β)
println("B = A'C")
@btime mul!($B,Transpose($A),$C)
println("B += A'C")
@btime mul!($B,Transpose($A),$C,$α,$β)
end
|
Size | branch | C = A*B |
C += A*B |
B = A'B |
B += A'B |
---|---|---|---|---|---|
3x3 | master | 6.6 ns (0 allocs) | 193.4 ns (3 allocs) | 9.2 ns (0 allocs) | 193.8 ns (3 allocs) |
3x3 | new | 6.7 ns (0 allocs) | 8.1 ns (0 allocs) | 6.6 ns (0 allocs) | 8.1 ns (0 allocs) |
10x10 | master | 123.3 ns (0 allocs) | 889.5 ns (7 allocs) | 489.6 ns (0 allocs) | 683.9 ns (3 allocs) |
10x10 | new | 124.1 ns (0 allocs) | 134.1 ns (0 allocs) | 196.9 ns (0 allocs) | 207.2 ns (3 allocs) |
20x20 | master | 613.7 ns (0 allocs) | 4.83 μs (7 allocs) | 3.76 μs (0 allocs) | 4.21 μs (3 allocs) |
20x20 | new | 613.7 ns (0 allocs) | 580.9 ns (0 allocs) | 613.4 ns (0 allocs) | 591 ns (3 allocs) |
SizedArray
Size | branch | C = A*B |
C += A*B |
B = A'B |
B += A'B |
---|---|---|---|---|---|
3x3 | master | 10 ns (0 allocs) | 195 ns (3 allocs) | 10 ns (0 allocs) | 194 ns (0 allocs) |
3x3 | new | 10 ns (0 allocs) | 11 ns (0 allocs) | 11 ns (0 allocs) | 11 ns (0 allocs) |
10x10 | master | 119 (0 allocs) | 905 ns (7 allocs) | 406 ns (0 allocs) | 680 ns (3 allocs) |
10x10 | new | 125 ns (0 allocs) | 129 ns (0 allocs) | 195 ns (0 allocs) | 199 ns (0 allocs) |
20x20 | master | error | error | error | error |
20x20 | new | 617 ns (0 allocs) | 581 ns (0 allocs) | 612 ns (0 allocs) | 576 ns (0 allocs) |
I fixed the backwards-compatibility problem. Are there further changes that need to be made? |
src/matrix_multiply_add.jl
Outdated
a::StaticMatrix, b::StaticMatrix, _add::MulAddMul) = | ||
mul_blas!(transpose(Sc), c, transpose(Sb), transpose(Sa), b, a, _add) | ||
|
||
# TODO: Get this version of mul_blas! working so there's backward-compatibility with older Julia versions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this TODO still needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope. Fixed and deleted out the old code.
test/matrix_multiply_add.jl
Outdated
@test b ≈ 5A'c | ||
|
||
bmark = @benchmark mul!($c,$A,$b) samples=10 evals=10 | ||
@test minimum(bmark).allocs == 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to use @allocated
or does that not work properly? (I know we've had some problems with it lately related to changes in Base.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't get it to not record allocations with @allocated
. I'm guessing @benchmark
will be more accurate? I've added commented out lines using @test_noalloc
below each test, as well as the allocations recorded.
Oops it looks like I caused a conflict by merging some unrelated test tools. I merged into your branch which will hopefully fix that up. Generally this is looking quite good now (a lot less scary in terms of amount of code added!) Great job with the table of benchmarks there's some really impressive improvements there. Could you add the benchmarks to |
test/matrix_multiply_add.jl
Outdated
# println("B = A'C + B") | ||
# @btime mul!($B,Transpose($A),$C,$α,$β) | ||
# end | ||
# benchmark_matmul(20,20,SizedArray) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's move this into the benchmarks
directory so it'll be kept up to date.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
$ex | ||
@test(@allocated($ex) == 0) | ||
end) | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you've found @benchmark
reliable for this, let's just make test_noalloc
use that instead? Something like:
macro test_noalloc(ex)
quote
bmark = @benchmark $(esc(ex)) samples=10 evals=10
@test minimum(bmark).allocs == 0
end
end
You could put this in test_util
as well and we can use it everywhere instead of @allocated
? Couple of questions:
- We don't actually want to do a benchmark here so it seems a bit heavy weight. We can put up with that though, unless it slows down the tests a lot?
- Can you reduce the samples and evals?
- Is the
minimum
there required, or will anything other than the maximum do? (Are we measuring the right thing?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ugh, there's something really broken about the way Julia or BenchmarkTools processes the correct-seeming use of esc
in the nested macro above.
Also I didn't realize quite what minimum
does; how about this instead:
macro test_noalloc(ex)
ex = :(@test minimum(@benchmark($ex, samples=10, evals=10)).allocs == 0)
ex.args[2] = __source__ # trick: make `@test` aware of the source location of `@test_noalloc`
esc(ex)
end
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@test_noalloc
macro sounds like a good idea. We can swap @benchmark
with something else if these tests slow down the whole test suite a lot. My guess is that it'd be OK as long as you have something like samples=10 evals=10
.
Is the
minimum
there required
bmark
is the Trial
whose .allocs
field is already an Int
:
It seems that Base.push!(t::Trial, time, gctime, memory, allocs)
(which is used to accumulate the benchmark trial results) already takes maximum
over all the trials:
(Also, minimum
is the minimum over the wall times. So I don't think we need it anyway.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your suggested test_noalloc(ex)
macro didn't work for me? When I use it within a function (not in global scope) it can't find the local arguments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Try put a $
in front of the local arguments (it's a feature of @benchmark
to make sure that it captures them properly).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your suggested test_noalloc(ex) macro didn't work for me?
Andy's suggestion seems to work. But oh wow, I had a look at BenchmarkTools.generate_benchmark_definition
, it's kind of a horror show 😬 it calls Core.eval
in the macro and really seems to go out of its way to subvert the usual macro expansion rules in a weird way.
Maybe it has to be that way for some reason, but it sure is doing some weird stuff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's too hard to wrap @benchmark
reliably I'm completely ok with just having it written out by hand in a few places.
I tried looking more into why @allocated
is failing but it quickly led to a very deep rabbit hole which I'm still trying to climb out of.
I went ahead and merged this, we can sort the Thanks for a very nice contribution! |
This adds support for the new 5-argument in-place matrix multiplication in Julia v1.3 (#357 ).
We can probably consolidate the multiply-add and with the matrix multiply methods, but I've left them separate for now.
I've also added a quick "hack" to avoid an invalid pointer error (#734) when doing in-place multiplication with SizedArrays large enough that they get passed to BLAS.