Skip to content

unbreak CI for now #1822

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from
Closed
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion test/cuda/layers.jl
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,8 @@ function gpu_gradtest(name::String, layers::Vector, x_cpu = nothing, args...; te
# test
if test_cpu
if VERSION >= v"1.7" && layer === GroupedConvTranspose && args[end] == selu
@test_broken y_gpu ≈ y_cpu rtol=1f-3 atol=1f-3
# FIXME revisit this after CUDA deps on CI are updated
@test y_gpu ≈ y_cpu rtol=2 atol=2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't really help here since the error bounds are pretty high and the broken test is already specific to ConvTranspose + selu. Can we specify a kind of failure we expect? Say we expect the test to fail but not error?

Copy link
Member Author

@ToucheSir ToucheSir Jan 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, because the test doesn't always fail! It was a choice between this and skipping the test entirely.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather avoid having high error tolerances since that isn't very helpful in the real world, and and it's unlikely an error would be raised in this code path (although I'd rather retain the test). Something that doesn't error and gives inaccurate answers would be hard to debug! Can we compare against a standard (say tf/ pytorch) implementation?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could, but that doesn't address the issue of very high variance in results. The main problem is that we (or at least I) can't figure out where that variance is coming from. It could be something deep within the bowels of cuDNN, and since the forward pass of ConvTranspose is literally 1 cuDNN call + broadcasted bias add + broadcasted activation, there'd be very little we could do about that.

All that said, I'm happy to change it to a @test_skip if you feel that's more appropriate.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the issue that we might see Red ci spuriously if this test passes as it is on master? I don't think we've encountered that very frequently with the current setup right?

I'm fairly certain that the underlying issue would be in CUDA/ cudnn, and that would be pretty out of our hands at that point. To fix this we'd need Julia kernels which might not be the worst idea but seeing as the motivation is to fix one combination of conv and activation, it's fair to say that it would be low priority with little overall benefit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are to let this test be here with wide tolerance it would be good to know what CUDA deps are referred to in the comment and what we should do to be alerted when this combination works to a decent degree again.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the thing, I don't know because I couldn't repro anything. It may well be that the CUDA deps are a red herring and the problem lies elsewhere (say with GPUCompiler + the new LLVM + compiler changes on Julia 1.7). I've added a commit with the last CUDA.versioninfo() output I got out of Buildkite.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts? Concerns? Again, I'm happy to turn this into a @test_skip with the previous tolerance to get the PR merged.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a functional difference here between @test_skip and the adjusted tolerances? They are so wide, I would expect a 200% relative tolerance to be tantamount to skipping any testing. So, let's just change to @test_skip and merge.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

else
@test y_gpu ≈ y_cpu rtol=1f-3 atol=1f-3
end
Expand Down