The focus of this post is to determine the optimal algorithm to search an unsorted array of unique elements for a particular element, or a random index of one of the non-unique elements matching the searched for element in an unsorted array.
In standard programming on the CPU, the typical algorithm to search an unsorted array is to sequentially check every index of an array until the particular value of interest is found. This operation is of O(n) complexity and on average takes n/2 inquiries to locate the unique value. While linear time complexities are already efficient, we can still be limited by the sequential nature of a search on the CPU.
One option is to sort the array of elements so that searching can be performed by an algorithm with a more efficient complexity. Once sorted, searching for a unique value can be done with an algorithm such as binary search, which has a very efficient O(logn) time complexity. However, this may not be preferable due to a host of reasons including memory size limitations or data immutability.
Another option is that we can leverage the massively parallelized capability of CUDA-enabled GPU's to speed up our searches. In fact, this may be the only option in cases where the array of elements is already contained in the GPU device's memory. And in this case, we can search for our element without any data transfer between host and device or changing any data. However, for arrays that live on the host memory and not the GPU device, we must pay the cost to transfer the array to the device before searching, but this may still be faster than the sequential search of the CPU.
The first way we can try to design the searching algorithm is to use a strided approach. By this, we mean that if we have N elements to search, we will deploy some m threads (where m < N) so that each thread searches k elements before returning. However, there are multiple ways to do this.
The most obvious way is to have each thread look at its k elements using strided memory accesses. This means that the first thread will begin its search at index 0 and end at index k-1, the second thread will begin at index k and end at index 2k-1, and so on. Visually, we can see this type of memory access in the following diagram where each thread will search a total of k=3 elements.
Iteration 1:
thread 0 thread 1 thread 2
| | |
| | |
v v v
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | ... | N-1 |
Iteration 2:
thread 0 thread 1 thread 2
| | |
| | |
v v v
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | ... | N-1 |
Iteration 3:
thread 0 thread 1 thread 2
| | |
| | |
v v v
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | ... | N-1 |
The next question to ask is: how many elements should each thread search through? Surely, there must be an optimal number. That question is not always an easy one to ask, but what we can do is try multiple values of k to see if there is a trend.
The CUDA kernel for this function is given as follows:
__global__ void dev_Strided_Offset_N_Search(int *array, int uniqueValue, int offset, int arraySize, int *foundIndex){
// Calculate thread id.
int tid = blockDim.x * blockIdx.x + threadIdx.x;
// Initialize currentValue and actualIndex to register memory.
int currentValue, actualIndex;
// Iterate through offset N number of adjacent elements.
for (int k = 0; k < offset; k++){
// Calculate actual array index.
actualIndex = tid * offset + k;
// Ensure thread is not out of bounds.
if ( actualIndex < arraySize ) {
// Retrieve current value from global memory to be checked.
currentValue = array[actualIndex];
// Check if current value is the unique value.
if ( currentValue == uniqueValue ) {
// Unique value found, store its index in the foundIndex global memory variable.
*foundIndex = actualIndex;
}
}
}
}
This function takes in a pointer to an array of elements to be searched (in this case an array of integers), the unique value you are searching for, the number of elements each thread should search (also called the offset), the number of elements in the array, and finally a pointer to an integer that represents the index of the located elements. The first few lines are standard CUDA and C commands such as determining the thread's id and initializing integers for later in the function. Next, the function will iterate its search for k elements (offset). In the iteration stage, the function determines the index of the array to search next (actualIndex), ensure the index is within the bounds of the array, and retrieve the value stored at the current index. If this value matches the element that is being searched for, that thread will write the index of that element to the input parameter pointer foundIndex. However, this pointer is accessible to all threads, so it may be overwritten if another instance of the element is found.
While this method works, we can still improve our algorithm. Using strided memory access calls, as illustrated above, is not the most efficient way to access global memory. Instead, we can reorganize what we have above to make our global memory accesses more efficient using coalesced memory accesses.
A simplified definition for a coalesced memory access is when all threads in a warp access adjacent memory locations at once. In the example of the strided N search, the threads in each warp (sets of 32 threads) were not accessing adjacent data. Rather, they were accessing data that was separated by k elements. It can be shown that the bandwidth of the of the GPU (effective speed) is slowed significantly by increasing the separation of the accessed memory elements.
Therefore, in order to make our algorithm faster, we want all of our threads to be accessing memory locations that are adjacent to each other in each iteration. In the first iteration, the first thread will search index 0, the second thread will search index 1, and so on. In the next iteration, the first thread will search at index m (where m is the total number of threads), the second thread will search at m+1, and so on. As before, we illustrate the first two iterations of this visually where there are a total of m=4 threads:
Iteration 1:
tid0 tid1 tid2 tid3
| | | |
| | | |
v v v v
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | ... | N-1 |
Iteration 2:
tid0 tid1 tid2 tid3
| | | |
| | | |
v v v v
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | ... | N-1 |
This improvement is implemented in the following CUDA code,
// Calculate thread id.
int tid = blockDim.x * blockIdx.x + threadIdx.x;
// Initialize currentValue and actualIndex to register memory.
int currentValue, actualIndex;
// Iterate through offset N number of adjacent elements.
for (int k = 0; k < numToCheck; k++){
// Calculate actual array index.
actualIndex = numOfThreads * k + tid;
// Ensure thread is not out of bounds.
if ( actualIndex < arraySize ) {
// Retrieve current value from global memory to be checked.
currentValue = array[actualIndex];
// Check if current value is the unique value.
if ( currentValue == uniqueValue ) {
// Unique value found, store its index in the foundIndex global memory variable.
*foundIndex = actualIndex;
}
}
}
}
where the only major difference the actualIndex variable which controls the index that the thread will search in the current iteration.
By having all of our threads in a warp access adjacent memory locations, the speed of our code will improve. But again, we are faced with the question of how many threads should we launch to tackle this problem (or asking the same question: how many elements does each thread search?). Again, we can run this code with a set of varying k or m values and find out.
From here, we can make another small improvement. When we iterate through the for loop in the coalesced example above, we can actually unroll this loop. What that means is instead of writing a small set of instructions and looping over the set, we can instead simply write out all of the instructions at once, which actually gives us a small performance benefit. Although we could do this by hand, it is much easier to add a single line #pragma unroll, which will do this for us.
We add this #pragma command to the above coalesced N search example above:
__global__ void dev_Unrolled_Coalesced_N_Search(int *array, int uniqueValue, int numToCheck, int numOfThreads, int arraySize, int *foundIndex){
// Calculate thread id.
int tid = blockDim.x * blockIdx.x + threadIdx.x;
// Initialize currentValue and actualIndex to register memory.
int currentValue, actualIndex;
// Iterate through offset N number of adjacent elements.
for (int k = 0; k < numToCheck; k++){
// Calculate actual array index.
actualIndex = numOfThreads * k + tid;
// Ensure thread is not out of bounds.
if ( actualIndex < arraySize ) {
// Retrieve current value from global memory to be checked.
currentValue = array[actualIndex];
// Check if current value is the unique value.
if ( currentValue == uniqueValue ) {
// Unique value found, store its index in the foundIndex global memory variable.
*foundIndex = actualIndex;
}
}
}
}
The last way we can implement the search routine is derived from a branch off of the coalesced N search method. The iterating loop in the kernel does allow each thread to search multiple indices of our array, but it also adds extra overhead in our code. The finally implementation of the search algorithm strips all of frivolous extras of the kernel away, including the iteration.
Instead, we implement the search so that each thread that is launched only searches one element (whose index matches the thread's id) before returning. This allows us to strip away as much extra machine instruction as possible, and because of this we implement the most simple solution as well.
__global__ void dev_Full_Coalesced_Search(int *array, int uniqueValue, int arraySize, int *foundIndex){
// Calculate thread id.
int tid = blockDim.x * blockIdx.x + threadIdx.x;
// Retrieve current value from global memory to be checked.
int currentValue = array[tid];
// Check if current value is the unique value.
if ( currentValue == uniqueValue ) {
// Unique value found, store its index in the foundIndex global memory variable.
*foundIndex = tid;
}
}
This type of solution is very elegant in the fact that it is about as simple as CUDA kernels can be. Sometimes, however, we find that it is better to load kernels up with a slightly higher workload, and other times it is not.
We run a test code of all four of these implementations given in the file Test_Search_Unique_Value_CUDA.cu which is compiled with the included Makefile. For the strided, coalesced, and unrolled-coalesced implementations, we varying the number of elements each thread searchs from 1 to 64. In addition, run the full-coalesced implementation 64 times (not varying any parameter) to get a solid baseline. All test cases are searching an array 500,000 integers for a unique value.
We can see that the strided test runs were by far the worst (higher runtime is worse), especially as we increase the number of the elements each thread checks, this is due to the increasing space between memory accesses.
Next we see that the coalesced and unrolled coalesced implementations are the next quickest algorithms. We can see the slight edge that the #pragma unroll command has in speeding up this code.
Finally, we can see that the full coalesced method (one thread per element) search is consistently the fastest method (perhaps except for one small region).
While building code like this, it could be the easiest solution to use the full-coalesced method due to its simplicity in its one-thread-one-element nature. The trade off of performance compared to the other methods and its simplicity could very well make this be the best method for your code. However, to fully explore these algorithms, let us do one more test. We are interested in the regime where the full-coalesced method and the unrolled-coalesced method seem to overlap.
For the final test, we set the unrolled_coalesced method's number of elements to search to the minimum runtime found in the previous plot to k=12, and rerun the test again 50 times.
From comparing the optimal parameters for the unrolled-coalesced method to the full-coalesced method we see that the unrolled method is actually the fastest method while the number elements to search is near k=12.