Add more docs for SharedMemory (#2303)

charleskawczynski · web-flow · commit 4a4261e491c9 · 2025-04-21T14:35:14.000-04:00
diff --git a/docs/src/shmem_design.md b/docs/src/shmem_design.md
@@ -35,12 +35,40 @@ The high-level view of the design is:
    (different operators require different arguments, and therefore different
    types and amounts of shmem).
  - Recursively fill the shmem for all `StencilBroadcasted`. This is done
-   by reading the argument data from `getidx`
+   by reading the argument data from `getidx`. See the section discussion below for more details.
  - The destination field is filled with the result of `getidx` (as it is without
    shmem), except that we overload `getidx` (for supported `StencilBroadcasted`
    types) to retrieve the result of `getidx` via `fd_operator_evaluate`, which
    retrieves the result from the shmem, instead of global memory.
 
+### Populating shared memory, and memory access safety
 
+We use tail-recursion when filling shared memory of the broadcast expressions.
+That is, we visit leaves of the broadcast expression, then work our way up.
+It's important to note that the `StencilBroadcasted` and `Broadcasted` can be
+interleaved.
 
+Let's take `DivergenceF2C()(f*GradientC2F()(a*b)))` as an example (depicted in
+the image below).
 
+Recursion must go through the entire expression in order to ensure that we've
+reached all of the leaves of the `StencilBroadcasted` objects (otherwise, we
+could introduce race conditions with memory access). The leaves of the
+`StencilBroadcasted` will call `getidx`, below which there are (by definition)
+no more `StencilBroadcasted`, and those `getidx` calls will read from global
+memory. All subsequent reads will be from shmem(as they will be caught by the
+`getidx(parent_space, bc::StencilBroadcasted
+{CUDAWithShmemColumnStencilStyle}, idx, hidx)` defined in the
+`ClimaCoreCUDAExt` module).
+
+In the diagram below, we traverse and fill the yellow highlighted sections
+(bottom first and top last). The algorithmic impact of using shared memory is
+that the duplicate global memory reads (highlighted in red circles) become one
+global memory read (performed in `fd_operator_fill_shmem!`).
+
+Finally, its important to note that threads must by syncrhonized after each node
+in the tree is filled, to avoid race conditions for subsequent `getidx
+(parent_space, bc::StencilBroadcasted{CUDAWithShmemColumnStencilStyle}, idx,
+hidx)` calls (which are retrieved via shmem).
+
+![](shmem_diagram_example.png)
diff --git a/docs/src/shmem_diagram_example.png b/docs/src/shmem_diagram_example.png
diff --git a/ext/cuda/operators_fd_shmem_common.jl b/ext/cuda/operators_fd_shmem_common.jl
@@ -323,9 +323,18 @@ Base.@propagate_inbounds function fd_resolve_shmem!(
     ᶜidx = get_cent_idx(idx)
     ᶠidx = get_face_idx(idx)
 
-    _fd_resolve_shmem!(idx, hidx, bds, sbc.args...) # propagate idx, not bc_idx recursively through broadcast expressions
+    # Here, we use tail-recursion. We visit leaves of the broadcast expression,
+    # then work our way up. The StencilBroadcasted and Broadcasted can be
+    # interleaved (e.g., `DivergenceF2C()(f*GradientC2F()(a*b)))`. The leaves of
+    # the StencilBroadcasted will call `getidx`, below which there are
+    # (by definition) no more `StencilBroadcasted`, and those `getidx` calls
+    # will read from global memory. Immediately above those reads, all
+    # subsequent reads will be from shmem (as they will be caught by the
+    # `getidx` defined above).
+    _fd_resolve_shmem!(idx, hidx, bds, sbc.args...)
 
-    # After recursion, check if shmem is supported for this operator
+    # Once we're about ready to fill the shmem, check if shmem is supported for
+    # this operator
     Operators.fd_shmem_is_supported(sbc) || return nothing
 
     # There are `Nf` threads, where `Nf` is the number of face levels. So,