-
-
Notifications
You must be signed in to change notification settings - Fork 127
add bilinear upsampling #262
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@maxfreu this works also on GPU! Can you benchmark it against your implementation? |
Hey, please give me anpther few days until I‘m back home mid next week, then I‘ll check. Have a good start into the new year! :) |
happy holidays! |
There's a number of review comments in the Flux PR which would also be good to go over before merging this |
I think I have already addressed them |
I think the function should be renamed to When I checked the performance, the hand written kernel was 3x faster in the forward pass, didnt test backward. But that was measured only with one size. I will report again when I had the time to benchmark properly. Furthermore I don't think the pytorch kernel corrects the edges of the gradient. But maybe whats good enough for them also works for us. |
I can rename to
I can remove the edge correction, but that will cause the gradient test to fail. I would like to see the discussion that led them to that decision before going through the same path |
Doesn't "bilinear" imply 2D? Then it could be just |
I just benchmarked on GPU using two different sizes which I took from the original U-Net, so I suspect they are representative: 196x196x128x1, time in ms
32x32x1024x1, time in ms
GPU: Nvidia 1080Ti, CUDA,jl 2.3, CUDA 11.0 Note that the upsampling and gradient results are not the same, they differ at sharp edges in the image. The pytorch version indeed doesn't perform corrections at the image border, but I couldn't trace any discussion in the pytorch github. Further thinking about the API, I suggest the following: upsample_bilinear(x, scale::NTuple)
∇upsample_bilinear(Δ, x) That way we can easily recover the original image size, even if we later allow real scaling factors; we don't have to hope that the rounding to integer image size goes right. |
Maybe I guess using |
Yes, that's even better.
Yes, that would imply that arbitrary, non-integer scales are accepted, which is not the case yet. However in DL you most often want to upsample by a factor of 2 to compensate pooling. So I would keep the To subsume: upsample_bilinear(x, scale::NTuple{2, Int})
# later maybe sth like:
upsample_bilinear(x, scale=(1,1); sz=Union{nothing, NTuple})
∇upsample_bilinear(Δ, sz::NTuple{4, Int}) # sz=size(x) After these changes I'd say it's almost ready to merge. Maybe the test should be extended to c,n != 1, but then it's good to go. Performance improvements and specialized GPU stuff can come later. By the way: would it be acceptable that CPU and GPU implementation produce slightly different results? |
Let's keep the edge correction, seems like a more complete implementation. |
@maxfreu could you suggest the performance improvements here before we move forward |
weight = similar(Δ, eltype(Δ), (size(kern)..., n_chan, n_chan)) | ||
weight .= 0 | ||
|
||
for i in 1:n_chan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use cat and repeat here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does repeat support GPU Arrays?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure that weight
can be created anew by using cat and repeat
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe also use zero(weight)
? or is .= 0
type stable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something like this?
julia> t = zeros(3,3,3,3);
julia> w = rand(3,3)
3×3 Array{Float64,2}:
0.966086 0.0129205 0.315534
0.823242 0.245357 0.179242
0.123778 0.372735 0.869323
julia> for i in 1:3
t[:,:,i,i] = w
end
julia> cat(Iterators.repeated(w, 3)..., dims = (3,4)) == t
true
Co-authored-by: ltjkoomen
I have not much to say about the current implementation, as I didn't think into it. @ltjkoomen would be of help here. Note however that the forward pass is single-threaded on CPU, whereas backward is multi-threaded through conv + nnpack. So multi-threading would be of help. |
937fd65
to
1ac7d2c
Compare
@maxfreu your benchmarks show that your implementation is much faster on gpu, both in the forward pass (where there is no edge correction) and in the backward. Having discrepancies between gpu and cpu versions is not ideal though, unless they are very small. Any chance your port could be made to run also on cpu? |
I prefer to stick to |
Well, there is always this which we can slam in...
convinced. Keep k. Edit: The CPU kernel implementations for umsampling can be found here - almost 600 lines of code, covering different methods and dimension orderings. I wonder how this would look like in julia - but I think that's overkill for a start. |
Yeah, I think for now it is important to introduce a ton of functionality we miss, decent performance is good enough, we can further optimize later. Moreover, I've not ported c++ code for the last couple of years, I'm very happy about that and don't want to ruin it 😄 |
I would like the implementation to be correct to its academic paper with a good performance benchmark or known issues. Optimising for performance later becomes much more easier that way, otherwise in my experience it's a much slower process |
This is good to go, we can merge when test complete |
This is doing a bunch of allocation and since nnlib is supposed to be where performance is tuned, we should try to understand where and how we can reduce that |
106ea6c
to
caca2c8
Compare
I don't have the time (and honestly also the will) to do this. My intention here is to add some functionality and don't let the work in FluxML/Flux.jl#1180 go wasted. Anyone is very welcome to improve on it later, but this is not going to be me. |
I would second that. Remember to change the name to |
Sorry to annoy you, but I see a 7x performance decrease (15ms -> 110ms for 32x32x1024x1) in |
Hey, I wanted to thank everybody for the effort, especially @CarloLucibello! I learned a lot :-) When I find the time I can try to port the C++ stuff, and bring CPU and GPU implementations in line. |
thank you for the review! It would be nice to leverage julia ecosystem for this, e.g. ImageTransformations.jl or Interpolations.jl, but I fear it won't happen anytime soon, as we need gpu compatible and differentiable operators |
Co-authored-by: @ltjkoomen
Follow up to FluxML/Flux.jl#1180
I tried to preserve authorship with the commit message but it looks like I wasn't successful.
If someone knows how to do it, please tell me, it would be good to give @ltjkoomen
the credit he deserves for such good work (I did only a few minor changes)
Only 2d bilinear upsampling is implemented with the (hopefully future proof) API
TODO:
NNlib.conv
instead ofFlux.Conv