Skip to content

Commit 1734b2b

Browse files
committed
Use threadfence_block instead of sync_threads inside of chacha_blocks!
Use threadfence_block to ensure that ChaCha rounds that writes to memory performed by the parallelized ChaCha rounds are observed by all threads in the same block before switching from columnar -> diagonal rounds or diagonal -> columnar rounds. This function previously used sync_threads to accomplish the same goal, which appears to have added roughly 2x or so overhead.
1 parent a0c08a1 commit 1734b2b

File tree

1 file changed

+2
-2
lines changed

1 file changed

+2
-2
lines changed

src/ChaCha.jl

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -220,11 +220,11 @@ function _cuda_chacha_rounds!(state, doublerounds)
220220
for _ = 1:doublerounds
221221
# Columnar rounds
222222
_QR!(state_slice, i, i + 4, i + 8, i + 12)
223-
CUDA.sync_threads()
223+
CUDA.threadfence_block()
224224

225225
# Diagonal rounds
226226
_QR!(state_slice, dgc1, dgc2, dgc3, dgc4)
227-
CUDA.sync_threads()
227+
CUDA.threadfence_block()
228228
end
229229

230230
nothing

0 commit comments

Comments
 (0)