-
Notifications
You must be signed in to change notification settings - Fork 5.8k
More data types supported in cv::cuda::transpose() #3371
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
chacha21
wants to merge
9
commits into
opencv:4.x
Choose a base branch
from
chacha21:cuda_transpose
base: 4.x
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from 1 commit
Commits
Show all changes
9 commits
Select commit
Hold shift + click to select a range
51180e9
More data types supported in cv::cuda::transpose()
chacha21 be2dd8d
added fast paths
chacha21 9e3eb3d
missing cudaDeviceSynchronize()
chacha21 00abef3
split (rows == 1) and (cols == 1) cases
chacha21 e5b152e
code clarity
chacha21 4f0e470
updated according to suggestions
chacha21 43a1691
removed useless and dead code
chacha21 29cb16a
support legacy npp stream handling
chacha21 3672694
Style : use tables of functions
chacha21 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be applied to
gridTranspose()
as well, I can't understand why it was previously missing, surely the result could have been in flight on returns from this function when the default stream is passed?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have the opposite question : why is it ever needed ? I let them here, but I don't understand the purpose of this extra synchronization when the default stream (sync by default) is used.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kernel launches are asynchronous with respect. The default stream syncs with other streams by default in legacy mode.
My interpretation is that the OpenCV API works on the assumption that if a stream isn't passed the user wants synchronization.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean that now that exist both
cudaStreamPerThread
andcudaStreamLegacy
, people using OpenCV always expect thecudaStreamLegacy
behaviour with the default stream, thus requiring thecudaDeviceSynchronize()
?If so, I agree.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No users of the api will expect functions which they don't pass a stream to, to be synchronous with respect to the host when they return. This would not be the case with either
cudaStreamPerThread
orcudaStreamLegacy
as the kernel launch is asynchronous in both cases.In the example from the docs
k_2 waits on k_1 because k_2 is in the legacy default stream, then k_3 waits on the legacy stream. Because of the specific way this has been set up k_1 and k_2 have finished executing before the call to
k_3<<<1, 1, 0, s>>>()
however the result from k_3 may still be in flight after control has returned to the host when you reach line 4.Now I haven't used per thread default streams (I always use explicit streams) but my understanding is that if the CUDA_API_PER_THREAD_DEFAULT_STREAM macro was used to enable per thread default streams k_1 would run before k_3 but both would be asynchronous with respect to k_2. Either way when control returns to the host on line 4 they may still all be in flight.
On the other hand if we have the following which is our case if no stream is passed
in either case
cudaStreamPerThread
orcudaStreamLegacy
all three kernels may still be in flight when control returns to the host on line 4 if we don't explicitly syncronize.Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, anyway, I just checked and
gridTranspose()
does ultimately callcudaDeviceSynchronize()
for null stream (see https://github.com/opencv/opencv_contrib/blob/4.x/modules/cudev/include/opencv2/cudev/grid/detail/transpose.hpp)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cudawarped Do you have open issues with the PR. I want to merge, if you do not mind.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@asmorkalov I am not 100% convinced that
isNppiNativelySupported
andisElemSizeSupportedByNppi
flags are needed. I can understand that @chacha21 wants a fall back option in case something goes wrong and because of that wants to seperate out the two logic paths but I am not sure if the redundant calls underisElemSizeSupportedByNppi
make the function harder to maintain, what do you think?Additionally
elemSize1
is unused, theisElemSizeSupportedByNppi
logic path won't be fully tested and due to the names of the nppi functions it may make more sense to just use the bit size instead of examiningelemSize
, i.e.could be
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whatever the decision, I will certainly agree : I explained my initial code structure, but I am OK to adapt for a more "OpenCV style"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (!(elemSize%2) && ((elemSize/2)==2))
looks very cryptic. I understand the logic behind it, butif (elemSize==4)
definitely more readable.