@@ -35,12 +35,40 @@ The high-level view of the design is:
35
35
(different operators require different arguments, and therefore different
36
36
types and amounts of shmem).
37
37
- Recursively fill the shmem for all ` StencilBroadcasted ` . This is done
38
- by reading the argument data from ` getidx `
38
+ by reading the argument data from ` getidx ` . See the section discussion below for more details.
39
39
- The destination field is filled with the result of ` getidx ` (as it is without
40
40
shmem), except that we overload ` getidx ` (for supported ` StencilBroadcasted `
41
41
types) to retrieve the result of ` getidx ` via ` fd_operator_evaluate ` , which
42
42
retrieves the result from the shmem, instead of global memory.
43
43
44
+ ### Populating shared memory, and memory access safety
44
45
46
+ We use tail-recursion when filling shared memory of the broadcast expressions.
47
+ That is, we visit leaves of the broadcast expression, then work our way up.
48
+ It's important to note that the ` StencilBroadcasted ` and ` Broadcasted ` can be
49
+ interleaved.
45
50
51
+ Let's take ` DivergenceF2C()(f*GradientC2F()(a*b))) ` as an example (depicted in
52
+ the image below).
46
53
54
+ Recursion must go through the entire expression in order to ensure that we've
55
+ reached all of the leaves of the ` StencilBroadcasted ` objects (otherwise, we
56
+ could introduce race conditions with memory access). The leaves of the
57
+ ` StencilBroadcasted ` will call ` getidx ` , below which there are (by definition)
58
+ no more ` StencilBroadcasted ` , and those ` getidx ` calls will read from global
59
+ memory. All subsequent reads will be from shmem(as they will be caught by the
60
+ `getidx(parent_space, bc::StencilBroadcasted
61
+ {CUDAWithShmemColumnStencilStyle}, idx, hidx)` defined in the
62
+ ` ClimaCoreCUDAExt ` module).
63
+
64
+ In the diagram below, we traverse and fill the yellow highlighted sections
65
+ (bottom first and top last). The algorithmic impact of using shared memory is
66
+ that the duplicate global memory reads (highlighted in red circles) become one
67
+ global memory read (performed in ` fd_operator_fill_shmem! ` ).
68
+
69
+ Finally, its important to note that threads must by syncrhonized after each node
70
+ in the tree is filled, to avoid race conditions for subsequent `getidx
71
+ (parent_space, bc::StencilBroadcasted{CUDAWithShmemColumnStencilStyle}, idx,
72
+ hidx)` calls (which are retrieved via shmem).
73
+
74
+ ![ ] ( shmem_diagram_example.png )
0 commit comments