Skip to content

add bilinear upsampling #262

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Jan 8, 2021
Merged

add bilinear upsampling #262

merged 12 commits into from
Jan 8, 2021

Conversation

CarloLucibello
Copy link
Member

@CarloLucibello CarloLucibello commented Dec 30, 2020

Co-authored-by: @ltjkoomen

Follow up to FluxML/Flux.jl#1180
I tried to preserve authorship with the commit message but it looks like I wasn't successful.
If someone knows how to do it, please tell me, it would be good to give @ltjkoomen
the credit he deserves for such good work (I did only a few minor changes)

Only 2d bilinear upsampling is implemented with the (hopefully future proof) API

bilinear_upsampling(x, k::NTuple)
∇bilinear_upsampling(Δ, k::NTuple) 

TODO:

  • use NNlib.conv instead of Flux.Conv
  • add tests
  • works on CPU
  • works on GPU
  • working AD and chainrules integration
  • more tests and cleanup

@CarloLucibello CarloLucibello marked this pull request as draft December 30, 2020 14:32
@CarloLucibello CarloLucibello marked this pull request as ready for review December 30, 2020 18:50
@CarloLucibello
Copy link
Member Author

@maxfreu this works also on GPU! Can you benchmark it against your implementation?
It would be nice to have just one generic implementation (provided it is fast)

@maxfreu
Copy link
Contributor

maxfreu commented Dec 30, 2020

Hey, please give me anpther few days until I‘m back home mid next week, then I‘ll check. Have a good start into the new year! :)

@CarloLucibello
Copy link
Member Author

happy holidays!

@DhairyaLGandhi
Copy link
Member

There's a number of review comments in the Flux PR which would also be good to go over before merging this

@CarloLucibello
Copy link
Member Author

I think I have already addressed them

@maxfreu
Copy link
Contributor

maxfreu commented Jan 2, 2021

I think the function should be renamed to upsample_bilinear_2d, so that later upsample_nearest and upsample_xyz_Nd can be added and nicely group together. I also don't like the lowercase 2d, but lowercase is more consistent, so better keep it this way.

When I checked the performance, the hand written kernel was 3x faster in the forward pass, didnt test backward. But that was measured only with one size. I will report again when I had the time to benchmark properly. Furthermore I don't think the pytorch kernel corrects the edges of the gradient. But maybe whats good enough for them also works for us.

@CarloLucibello
Copy link
Member Author

I think the function should be renamed to upsample_bilinear_2d, so that later upsample_nearest and upsample_xyz_Nd can be added and nicely group together. I also don't like the lowercase 2d, but lowercase is more consistent, so better keep it this way

I can rename to upsample_bilinear, no need to add 2d since we can use dispatch to cover the different cases

Furthermore I don't think the pytorch kernel corrects the edges of the gradient. But maybe whats good enough for them also works for us.

I can remove the edge correction, but that will cause the gradient test to fail. I would like to see the discussion that led them to that decision before going through the same path

@mcabbott
Copy link
Member

mcabbott commented Jan 3, 2021

I think the function should be renamed to upsample_bilinear_2d, so that later upsample_nearest

Doesn't "bilinear" imply 2D? Then it could be just upsample_bilinear, and (as you say) upsample<tab> may one day list other options.

@maxfreu
Copy link
Contributor

maxfreu commented Jan 4, 2021

I just benchmarked on GPU using two different sizes which I took from the original U-Net, so I suspect they are representative:

196x196x128x1, time in ms

this PR pytoch-translation
forward 8.1 0.6
backward 3 3.3

32x32x1024x1, time in ms

this PR pytoch-translation
forward 2.3 0.5
backward 16.2 5.6

GPU: Nvidia 1080Ti, CUDA,jl 2.3, CUDA 11.0

Note that the upsampling and gradient results are not the same, they differ at sharp edges in the image. The pytorch version indeed doesn't perform corrections at the image border, but I couldn't trace any discussion in the pytorch github.

Further thinking about the API, I suggest the following:

upsample_bilinear(x, scale::NTuple)
∇upsample_bilinear(Δ, x)

That way we can easily recover the original image size, even if we later allow real scaling factors; we don't have to hope that the rounding to integer image size goes right.

@mcabbott
Copy link
Member

mcabbott commented Jan 4, 2021

Maybe ∇upsample_bilinear(Δ, size::Tuple) would be even better, in case x can be freed?

I guess using upsample_bilinear(x, size::Tuple) for the forward function would imply that arbitrary scales are accepted. Haven't absorbed whether that would be hard or easy here.

@maxfreu
Copy link
Contributor

maxfreu commented Jan 5, 2021

Maybe ∇upsample_bilinear(Δ, size::Tuple) would be even better, in case x can be freed?

Yes, that's even better.

I guess using upsample_bilinear(x, size::Tuple) for the forward function would imply that arbitrary scales are accepted. Haven't absorbed whether that would be hard or easy here.

Yes, that would imply that arbitrary, non-integer scales are accepted, which is not the case yet. However in DL you most often want to upsample by a factor of 2 to compensate pooling. So I would keep the scale Tuple in and restrict it to Int for a start. Later we can maybe add a keyword size to allow specifying an exact output size.

To subsume:

upsample_bilinear(x, scale::NTuple{2, Int})
# later maybe sth like:
upsample_bilinear(x, scale=(1,1); sz=Union{nothing, NTuple})

∇upsample_bilinear(Δ, sz::NTuple{4, Int}) # sz=size(x)

After these changes I'd say it's almost ready to merge. Maybe the test should be extended to c,n != 1, but then it's good to go. Performance improvements and specialized GPU stuff can come later.

By the way: would it be acceptable that CPU and GPU implementation produce slightly different results?

@DhairyaLGandhi
Copy link
Member

Let's keep the edge correction, seems like a more complete implementation.

@DhairyaLGandhi
Copy link
Member

@maxfreu could you suggest the performance improvements here before we move forward

weight = similar(Δ, eltype(Δ), (size(kern)..., n_chan, n_chan))
weight .= 0

for i in 1:n_chan
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use cat and repeat here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does repeat support GPU Arrays?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that weight can be created anew by using cat and repeat

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe also use zero(weight)? or is .= 0 type stable?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like this?

julia> t = zeros(3,3,3,3);

julia> w = rand(3,3)
3×3 Array{Float64,2}:
 0.966086  0.0129205  0.315534
 0.823242  0.245357   0.179242
 0.123778  0.372735   0.869323

julia> for i in 1:3
         t[:,:,i,i] = w
       end

julia> cat(Iterators.repeated(w, 3)..., dims = (3,4)) == t
true

@maxfreu
Copy link
Contributor

maxfreu commented Jan 5, 2021

@maxfreu could you suggest the performance improvements here before we move forward

I have not much to say about the current implementation, as I didn't think into it. @ltjkoomen would be of help here. Note however that the forward pass is single-threaded on CPU, whereas backward is multi-threaded through conv + nnpack. So multi-threading would be of help.
Regarding GPU I can only offer to PR my pytorch port to CUDA. But probably my skills are not sufficient to make the results bit-wise equal to this code.

@CarloLucibello
Copy link
Member Author

@maxfreu your benchmarks show that your implementation is much faster on gpu, both in the forward pass (where there is no edge correction) and in the backward. Having discrepancies between gpu and cpu versions is not ideal though, unless they are very small. Any chance your port could be made to run also on cpu?

@CarloLucibello
Copy link
Member Author

I prefer to stick to ∇upsample_bilinear(Δ, k) instead of ∇upsample_bilinear(Δ, size(x)), because internally it would be using k in any case and I don't like to pass a factor in the forward and the input size in the back, is a bit confusing.
I don't see any problem in passing the scale k also in the non-integer case,
the equation y = floor(x * k), has a unique integer solution x for k >= 1 and any y.

@maxfreu
Copy link
Contributor

maxfreu commented Jan 5, 2021

Any chance your port could be made to run also on cpu?

Well, there is always this which we can slam in...

the equation y = floor(x * k), has a unique integer solution x for k >= 1 and any y.

convinced. Keep k.

Edit: The CPU kernel implementations for umsampling can be found here - almost 600 lines of code, covering different methods and dimension orderings. I wonder how this would look like in julia - but I think that's overkill for a start.

@FluxML FluxML deleted a comment from CarloLucibello Jan 5, 2021
@CarloLucibello
Copy link
Member Author

Edit: The CPU kernel implementations for umsampling can be found here - almost 600 lines of code, covering different methods and dimension orderings. I wonder how this would look like in julia - but I think that's overkill for a start.

Yeah, I think for now it is important to introduce a ton of functionality we miss, decent performance is good enough, we can further optimize later. Moreover, I've not ported c++ code for the last couple of years, I'm very happy about that and don't want to ruin it 😄

@DhairyaLGandhi
Copy link
Member

I would like the implementation to be correct to its academic paper with a good performance benchmark or known issues. Optimising for performance later becomes much more easier that way, otherwise in my experience it's a much slower process

@CarloLucibello
Copy link
Member Author

This is good to go, we can merge when test complete

@DhairyaLGandhi
Copy link
Member

This is doing a bunch of allocation and since nnlib is supposed to be where performance is tuned, we should try to understand where and how we can reduce that

@CarloLucibello
Copy link
Member Author

This is doing a bunch of allocation and since nnlib is supposed to be where performance is tuned, we should try to understand where and how we can reduce that

I don't have the time (and honestly also the will) to do this. My intention here is to add some functionality and don't let the work in FluxML/Flux.jl#1180 go wasted. Anyone is very welcome to improve on it later, but this is not going to be me.
We can file an issue with a reminder about checking performance, although some benchmarking has already done by @maxfreu above. Again, as I said, we miss a lot of functionality in Flux, and this is a show stopper for many, much more than performance issues.

@maxfreu
Copy link
Contributor

maxfreu commented Jan 7, 2021

My intention here is to add some functionality and don't let the work in FluxML/Flux.jl#1180 go wasted.

I would second that.

Remember to change the name to upsample_bilinear.

@maxfreu
Copy link
Contributor

maxfreu commented Jan 7, 2021

Sorry to annoy you, but I see a 7x performance decrease (15ms -> 110ms for 32x32x1024x1) in ∇bilinear_upsample when cats are added 🚫 🐱

@maxfreu
Copy link
Contributor

maxfreu commented Jan 8, 2021

Hey, I wanted to thank everybody for the effort, especially @CarloLucibello! I learned a lot :-) When I find the time I can try to port the C++ stuff, and bring CPU and GPU implementations in line.

@CarloLucibello
Copy link
Member Author

thank you for the review! It would be nice to leverage julia ecosystem for this, e.g. ImageTransformations.jl or Interpolations.jl, but I fear it won't happen anytime soon, as we need gpu compatible and differentiable operators

@maxfreu maxfreu mentioned this pull request Jan 25, 2021
4 tasks
@CarloLucibello CarloLucibello deleted the cl/upsample branch June 15, 2023 17:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants