What makes STQ faster versus SAP on the GPU? #3

Q-Minh · 2024-08-06T15:06:34Z

Q-Minh
Aug 6, 2024

I was recently learning a bit more about parallelizing contact simulation on the GPU, specifically focused on broad phase collision candidate determination. Your related article Time of Impact Dataset for Continuous Collision Detection and a Scalable Conservative Algorithm, as well as this public repository were both quite helpful to me.

I wanted to inquire about the reasoning behind STQ mapping better to the GPU than SAP. I'm a bit confused about this. In both approaches, the preprocessing and sorting phases are the same, i.e. we

Compute bounding boxes
Compute their statistical mean and variance
Sort them along largest variance axis

The difference between STQ and SAP is the sweep phase. In both approaches, algorithmically, we associate 1 thread per bounding box i, which incrementally does look-aheads, i.e. starting from its immediate next bounding box neighbor in sorted order, it keeps looking at the following neighbor as a potential overlap candidate j until j does not overlap with i along the sort axis. For every potential overlap candidate j, if there is no topological incidence between i and j and if they both overlap along all other axis', then they should be enqueued as collision candidates in a device-wide atomic fashion. However, STQ (as opposed to SAP) has additional synchronization steps, i.e. warp-wide synchronization, on top of the baseline device-wide synchronization. Work does not seem to be shared between threads of a common warp, i.e. there is no work stealing, so each thread in STQ has the same amount of work as it would have in SAP.

Do you mind elaborating on why STQ is a more favorable approach on the GPU versus SAP? I'm sure I'm missing something. There must be a reason that using a warp-wide shared queue is preferable.

Answered by zfergus

Aug 7, 2024

Hello,

Thank you for your interest in our work.

The different CUDA implementations can be found here. The difference between SAP and STQ on the GPU is how the work is divided between the threads.

In SAP, much like on the CPU, each thread is assigned a box and it iterates until it finds the next box in the sorted list that does not intersect. This has the limitation that the workload can be unbalanced between different threads. E.g., one thread may only find 1 intersection while another may find 100. On the CPU this works well as the thread with little work can be freed and assigned a new box. However, on the GPU the thread with little work will have to wait for all threads in its block to…

View full answer

zfergus · 2024-08-07T03:44:25Z

zfergus
Aug 7, 2024
Maintainer

Hello,

Thank you for your interest in our work.

The different CUDA implementations can be found here. The difference between SAP and STQ on the GPU is how the work is divided between the threads.

In SAP, much like on the CPU, each thread is assigned a box and it iterates until it finds the next box in the sorted list that does not intersect. This has the limitation that the workload can be unbalanced between different threads. E.g., one thread may only find 1 intersection while another may find 100. On the CPU this works well as the thread with little work can be freed and assigned a new box. However, on the GPU the thread with little work will have to wait for all threads in its block to finish before being assigned a new box to process.

In contrast, STQ uses a shared queue of boxes to examine and each thread pulls a single box from this queue, processes it, and (possibly) pushes a single box onto the queue to be processed in the next iteration. By processing boxes in this way, each thread will be assigned the same amount of work. The only downside is that threads need to sync up at times to make sure the queue's size is not subject to race conditions.

Hopefully, this clears up the difference between the two algorithms. Feel free to follow up with any other questions, and hopefully, we will have an updated publication to share soon.

Best,
Zach

0 replies

Q-Minh · 2024-08-07T13:06:48Z

Q-Minh
Aug 7, 2024
Author

Thanks for the answer, and thanks for sharing your insights! In fact, I had already read this code, but I my confusion comes from the fact that I don't see where the workload is being distributed. I understand that the advantage of STQ has to be work sharing, it's just that I'm not seeing in the kernel function where exactly that happens. I'm under the impression that say you have your 32 threads in a warp, and only 1 thread has overlaps, say 32 of them, it will only add an overlap in the shared queue 1 at a time, no? The advantage here has to be that these 32 overlaps will somehow end up being distributed over the 32 threads, but I'm just not seeing where that distribution happens in the code. I think it's just an code parsing problem that I'm having.

0 replies

Q-Minh · 2024-08-07T13:13:26Z

Q-Minh
Aug 7, 2024
Author

In this case, I think it might be a good idea to slightly modify the sweep kernel to first fill up the shared queue, and then proceed to sweeping as usual. So instead of this, which can result in a very small initial queue which can never grow in size, I would keep filling the queue up to its max capacity (as much as possible), before diving into the next phase. What do you think?

1 reply

zfergus Aug 7, 2024
Maintainer

This sounds like a viable path, but it would require to sync the threads/warp every iterations. Otherwise, the nbr_per_loop would be subject to a race condition or the queue might overflow.

It probably also makes sense to adjust the later push to add as many candidates to the queue as possible. This would address your comment below.

Q-Minh · 2024-08-07T13:32:29Z

Q-Minh
Aug 7, 2024
Author

Apologies for the flood of comments. I'm still confused. Specifically, I'm referring to this part of the code. A single thread can only push a single item onto the shared queue between warp synchronizations. This means that in our example of uneven thread workload, where 31 threads have 1 overlap, and 1 thread has 100 overlaps, the shared queue starts out with 32 overlap candidates. Then, when comes the time to process the next overlap candidates, the 31 threads will not push anything onto the shared queue, while that 1 thread which has 99 remaining overlaps to process will only push a single one of them. At the next iteration, the shared queue has a single item (overlap candidate) in it, which will be popped and processed by a single thread in the warp. That thread can then only push a single next item in the queue with 98 remaining overlaps to process, and the process repeats, decrementing 98 to 97 and so on sequentially between warp synchronizations. Am I missing something?

1 reply

zfergus Aug 7, 2024
Maintainer

I think you are right. It will still take as long as the largest number of intersections and threads will be underutilized. I double checked to make sure the original implementation also has this flaw. I'll have to brainstorm on how to address this issue.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

What makes STQ faster versus SAP on the GPU? #3

Uh oh!

{{title}}

Uh oh!

Replies: 4 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

What makes STQ faster versus SAP on the GPU? #3

Uh oh!

Q-Minh Aug 6, 2024

Replies: 4 comments · 2 replies

Uh oh!

zfergus Aug 7, 2024 Maintainer

Uh oh!

Uh oh!

Q-Minh Aug 7, 2024 Author

Uh oh!

Uh oh!

Q-Minh Aug 7, 2024 Author

Uh oh!

zfergus Aug 7, 2024 Maintainer

Uh oh!

Q-Minh Aug 7, 2024 Author

Uh oh!

zfergus Aug 7, 2024 Maintainer

Q-Minh
Aug 6, 2024

Replies: 4 comments 2 replies

zfergus
Aug 7, 2024
Maintainer

Q-Minh
Aug 7, 2024
Author

Q-Minh
Aug 7, 2024
Author

zfergus Aug 7, 2024
Maintainer

Q-Minh
Aug 7, 2024
Author

zfergus Aug 7, 2024
Maintainer