-
Hi @zfergus , I was recently learning a bit more about parallelizing contact simulation on the GPU, specifically focused on broad phase collision candidate determination. Your related article Time of Impact Dataset for Continuous Collision Detection and a Scalable Conservative Algorithm, as well as this public repository were both quite helpful to me. I wanted to inquire about the reasoning behind STQ mapping better to the GPU than SAP. I'm a bit confused about this. In both approaches, the preprocessing and sorting phases are the same, i.e. we
The difference between STQ and SAP is the sweep phase. In both approaches, algorithmically, we associate 1 thread per bounding box Do you mind elaborating on why STQ is a more favorable approach on the GPU versus SAP? I'm sure I'm missing something. There must be a reason that using a warp-wide shared queue is preferable. |
Beta Was this translation helpful? Give feedback.
Replies: 4 comments 2 replies
-
Hello, Thank you for your interest in our work. The different CUDA implementations can be found here. The difference between SAP and STQ on the GPU is how the work is divided between the threads. In SAP, much like on the CPU, each thread is assigned a box and it iterates until it finds the next box in the sorted list that does not intersect. This has the limitation that the workload can be unbalanced between different threads. E.g., one thread may only find 1 intersection while another may find 100. On the CPU this works well as the thread with little work can be freed and assigned a new box. However, on the GPU the thread with little work will have to wait for all threads in its block to finish before being assigned a new box to process. In contrast, STQ uses a shared queue of boxes to examine and each thread pulls a single box from this queue, processes it, and (possibly) pushes a single box onto the queue to be processed in the next iteration. By processing boxes in this way, each thread will be assigned the same amount of work. The only downside is that threads need to sync up at times to make sure the queue's size is not subject to race conditions. Hopefully, this clears up the difference between the two algorithms. Feel free to follow up with any other questions, and hopefully, we will have an updated publication to share soon. Best, |
Beta Was this translation helpful? Give feedback.
-
Thanks for the answer, and thanks for sharing your insights! In fact, I had already read this code, but I my confusion comes from the fact that I don't see where the workload is being distributed. I understand that the advantage of STQ has to be work sharing, it's just that I'm not seeing in the kernel function where exactly that happens. I'm under the impression that say you have your 32 threads in a warp, and only 1 thread has overlaps, say 32 of them, it will only add an overlap in the shared queue 1 at a time, no? The advantage here has to be that these 32 overlaps will somehow end up being distributed over the 32 threads, but I'm just not seeing where that distribution happens in the code. I think it's just an code parsing problem that I'm having. |
Beta Was this translation helpful? Give feedback.
-
In this case, I think it might be a good idea to slightly modify the sweep kernel to first fill up the shared queue, and then proceed to sweeping as usual. So instead of this, which can result in a very small initial queue which can never grow in size, I would keep filling the queue up to its max capacity (as much as possible), before diving into the next phase. What do you think? |
Beta Was this translation helpful? Give feedback.
-
Apologies for the flood of comments. I'm still confused. Specifically, I'm referring to this part of the code. A single thread can only push a single item onto the shared queue between warp synchronizations. This means that in our example of uneven thread workload, where 31 threads have 1 overlap, and 1 thread has 100 overlaps, the shared queue starts out with 32 overlap candidates. Then, when comes the time to process the next overlap candidates, the 31 threads will not push anything onto the shared queue, while that 1 thread which has 99 remaining overlaps to process will only push a single one of them. At the next iteration, the shared queue has a single item (overlap candidate) in it, which will be popped and processed by a single thread in the warp. That thread can then only push a single next item in the queue with 98 remaining overlaps to process, and the process repeats, decrementing 98 to 97 and so on sequentially between warp synchronizations. Am I missing something? |
Beta Was this translation helpful? Give feedback.
Hello,
Thank you for your interest in our work.
The different CUDA implementations can be found here. The difference between SAP and STQ on the GPU is how the work is divided between the threads.
In SAP, much like on the CPU, each thread is assigned a box and it iterates until it finds the next box in the sorted list that does not intersect. This has the limitation that the workload can be unbalanced between different threads. E.g., one thread may only find 1 intersection while another may find 100. On the CPU this works well as the thread with little work can be freed and assigned a new box. However, on the GPU the thread with little work will have to wait for all threads in its block to…