|
20 | 20 | // This model means the GPU runtime can specify the amount of memory allocated.
|
21 | 21 | // If this is more than the kernel assumed, the excess can be made available
|
22 | 22 | // using a language specific feature, which IR represents as a variable with
|
23 |
| -// no initializer. This feature is not yet implemented for non-kernel functions. |
24 |
| -// This lowering could be extended to handle that use case, but would probably |
25 |
| -// require closer integration with promoteAllocaToLDS. |
| 23 | +// no initializer. This feature is referred to here as "Dynamic LDS" and is |
| 24 | +// lowered slightly differently to the normal case. |
26 | 25 | //
|
27 | 26 | // Consequences of this GPU feature:
|
28 | 27 | // - memory is limited and exceeding it halts compilation
|
|
65 | 64 | // Kernel | Yes | Yes | No |
|
66 | 65 | // Hybrid | Yes | Partial | Yes |
|
67 | 66 | //
|
68 |
| -// Module spends LDS memory to save cycles. Table spends cycles and global |
69 |
| -// memory to save LDS. Kernel is as fast as kernel allocation but only works |
70 |
| -// for variables that are known reachable from a single kernel. Hybrid picks |
71 |
| -// between all three. When forced to choose between LDS and cycles it minimises |
| 67 | +// "Module" spends LDS memory to save cycles. "Table" spends cycles and global |
| 68 | +// memory to save LDS. "Kernel" is as fast as kernel allocation but only works |
| 69 | +// for variables that are known reachable from a single kernel. "Hybrid" picks |
| 70 | +// between all three. When forced to choose between LDS and cycles we minimise |
72 | 71 | // LDS use.
|
73 | 72 |
|
74 | 73 | // The "module" lowering implemented here finds LDS variables which are used by
|
|
115 | 114 | // use LDS are expected to hit the "Kernel" lowering strategy
|
116 | 115 | // - The runtime properties impose a cost in compiler implementation complexity
|
117 | 116 | //
|
| 117 | +// Dynamic LDS implementation |
| 118 | +// Dynamic LDS is lowered similarly to the "table" strategy above and uses the |
| 119 | +// same intrinsic to identify which kernel is at the root of the dynamic call |
| 120 | +// graph. This relies on the specified behaviour that all dynamic LDS variables |
| 121 | +// alias one another, i.e. are at the same address, with respect to a given |
| 122 | +// kernel. Therefore this pass creates new dynamic LDS variables for each kernel |
| 123 | +// that allocates any dynamic LDS and builds a table of addresses out of those. |
| 124 | +// The AMDGPUPromoteAlloca pass skips kernels that use dynamic LDS. |
| 125 | +// The corresponding optimisation for "kernel" lowering where the table lookup |
| 126 | +// is elided is not implemented. |
| 127 | +// |
| 128 | +// |
| 129 | +// Implementation notes / limitations |
| 130 | +// A single LDS global variable represents an instance per kernel that can reach |
| 131 | +// said variables. This pass essentially specialises said variables per kernel. |
| 132 | +// Handling ConstantExpr during the pass complicated this significantly so now |
| 133 | +// all ConstantExpr uses of LDS variables are expanded to instructions. This |
| 134 | +// may need amending when implementing non-undef initialisers. |
| 135 | +// |
| 136 | +// Lowering is split between this IR pass and the back end. This pass chooses |
| 137 | +// where given variables should be allocated and marks them with metadata, |
| 138 | +// MD_absolute_symbol. The backend places the variables in coincidentally the |
| 139 | +// same location and raises a fatal error if something has gone awry. This works |
| 140 | +// in practice because the only pass between this one and the backend that |
| 141 | +// changes LDS is PromoteAlloca and the changes it makes do not conflict. |
| 142 | +// |
| 143 | +// Addresses are written to constant global arrays based on the same metadata. |
| 144 | +// |
| 145 | +// The backend lowers LDS variables in the order of traversal of the function. |
| 146 | +// This is at odds with the deterministic layout required. The workaround is to |
| 147 | +// allocate the fixed-address variables immediately upon starting the function |
| 148 | +// where they can be placed as intended. This requires a means of mapping from |
| 149 | +// the function to the variables that it allocates. For the module scope lds, |
| 150 | +// this is via metadata indicating whether the variable is not required. If a |
| 151 | +// pass deletes that metadata, a fatal error on disagreement with the absolute |
| 152 | +// symbol metadata will occur. For kernel scope and dynamic, this is by _name_ |
| 153 | +// correspondence between the function and the variable. It requires the |
| 154 | +// kernel to have a name (which is only a limitation for tests in practice) and |
| 155 | +// for nothing to rename the corresponding symbols. This is a hazard if the pass |
| 156 | +// is run multiple times during debugging. Alternative schemes considered all |
| 157 | +// involve bespoke metadata. |
| 158 | +// |
| 159 | +// If the name correspondence can be replaced, multiple distinct kernels that |
| 160 | +// have the same memory layout can map to the same kernel id (as the address |
| 161 | +// itself is handled by the absolute symbol metadata) and that will allow more |
| 162 | +// uses of the "kernel" style faster lowering and reduce the size of the lookup |
| 163 | +// tables. |
| 164 | +// |
| 165 | +// There is a test that checks this does not fire for a graphics shader. This |
| 166 | +// lowering is expected to work for graphics if the isKernel test is changed. |
| 167 | +// |
| 168 | +// The current markUsedByKernel is sufficient for PromoteAlloca but is elided |
| 169 | +// before codegen. Replacing this with an equivalent intrinsic which lasts until |
| 170 | +// shortly after the machine function lowering of LDS would help break the name |
| 171 | +// mapping. The other part needed is probably to amend PromoteAlloca to embed |
| 172 | +// the LDS variables it creates in the same struct created here. That avoids the |
| 173 | +// current hazard where a PromoteAlloca LDS variable might be allocated before |
| 174 | +// the kernel scope (and thus error on the address check). Given a new invariant |
| 175 | +// that no LDS variables exist outside of the structs managed here, and an |
| 176 | +// intrinsic that lasts until after the LDS frame lowering, it should be |
| 177 | +// possible to drop the name mapping and fold equivalent memory layouts. |
| 178 | +// |
118 | 179 | //===----------------------------------------------------------------------===//
|
119 | 180 |
|
120 | 181 | #include "AMDGPU.h"
|
|
0 commit comments