-
Notifications
You must be signed in to change notification settings - Fork 5.8k
Add Otsu's method to cv::cuda::threshold #3943
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: 4.x
Are you sure you want to change the base?
Conversation
cc @cudawarped |
@troelsy Your test currently doesn't use the Otsu method and when it does it fails the assert. I can't review the PR until this is fixed. How does the performance of this compare to the CPU version? |
You're measuring the API calls there not the execution time of the kernels/memory allocs etc.. As you are including synchronization this should be about the same but in general it is not the same thing. If there wasn't any synchronization the time for the API calls would be much much smaller than the execution time.
You should be able to reduce the allocations/deallocations by using |
{ | ||
extern __shared__ unsigned long long shared_memory_u64[]; | ||
|
||
uint bin_idx = blockIdx.x * blockDim.x + threadIdx.x; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is confusing. As you have 256 threads per block and 256 blocks I would suggest the following launch parameters
dim3 block_all = dim3(n_bins);
dim3 grid_all = dim3(n_thresholds);
dim3 block_score = dim3(n_thresholds);
dim3 grid_score = dim3(1);
with
uint bin_idx = threadIdx.x;
uint threshold = blockIdx.x;
Also int
is usually more performant than uint
.
template <uint n_bins, uint n_thresholds> | ||
void compute_otsu_async(uint *histogram, uint *otsu_threshold, Stream &stream) | ||
{ | ||
CV_Assert(n_bins == 256); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure there is any reason to template n_bins
and n_thresholds
because your algorithm only supports 256 for both. I would remove the template and use statically allocated shared memory everywhere.
|
||
GpuMat gpu_threshold_sums = GpuMat(1, n_bins, CV_32SC1); | ||
GpuMat gpu_sums = GpuMat(1, n_bins, CV_64FC1); | ||
GpuMat gpu_variances = GpuMat(1, n_bins, CV_32SC4); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be CV_64C2
?
|
||
// TODO: Replace this is cv::cuda::calcHist | ||
template <uint n_bins> | ||
__global__ void histogram_kernel( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@asmorkalov Due to a circular dependency when adding the opencv_cudaimgproc to opencv_cudaarithm in CMakeLists.txt cv::cuda::calcHist
cannot be called from here. To avoid refactoring would it make sence at a minimum to only have a single implementation of calcHist
even if it is duplicated or do you have a better suggestion? i.e. copy the implemntation from cv::cuda::calcHist
here.
calcHist(src, gpu_histogram, stream); | ||
|
||
GpuMat gpu_otsu_threshold(1, 1, CV_32SC1); | ||
compute_otsu_async<n_bins, n_thresholds>(gpu_histogram.ptr<uint>(), gpu_otsu_threshold.ptr<uint>(), stream); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All kernel launches are async so I would rename this compute_otsu
.
signed long long threshold_variance_above_i64 = (signed long long)(threshold_variance_above_f32 * bin_count); | ||
signed long long threshold_variance_below_i64 = (signed long long)(threshold_variance_below_f32 * bin_count); | ||
blockReduce<n_bins>((signed long long *)shared_memory_i64, threshold_variance_above_i64, bin_idx, plus<signed long long>()); | ||
__syncthreads(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there enough shared memory to have two arrays and avoid __syncthreads();
without affecting the occupancy?
The casts ((signed long long *)
) is also redundant.
{ | ||
CV_Assert(depth == CV_8U); | ||
CV_Assert(src.channels() == 1); | ||
CV_Assert(maxVal == 255.0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does maxVal have to be 255.0?
cv::Mat dst_gold; | ||
double otsu_cpu = cv::threshold(src, dst_gold, 0, 255, threshOp); | ||
|
||
EXPECT_NEAR(otsu_gpu, otsu_cpu, 1e-5); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the calculated threshold is always an integer which is cast to a double shouldn't the abs_diff
be zero. Alternatively you could use ASSERT_DOUBLE_EQ instead.
I implemented Otsu's method in CUDA for a separate project and want to add it to cv::cuda::threshold
I have made an effort to use existing OpenCV functions in my code, but I had some trouble with
ThresholdTypes
andcv::cuda::calcHist
. I couldn't figure out how to includeprecomp.hpp
to get the definition ofThresholdTypes
. Forcv::cuda::calcHist
I tried addingopencv_cudaimgproc
, but it creates a circular dependency oncudaarithm
. I have include a simple implementation ofcalcHist
so the code runs, but I would like input on how to usecv::cuda::calcHist
instead.Pull Request Readiness Checklist
See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request
Patch to opencv_extra has the same branch name.