You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thanks for providing this amazing tutorial! Recently I am working on reduce0 and I found that I can double the performance of reduce_v0_baseline.cu kernel by simply changing a blockDim.x into THREAD_PER_BLOCK in the for loop
before
profile result:
after
profile result:
I guess this is because of loop unrolling? It's quite interesting that a simple change makes a big difference