Skip to content

Commit e17c1bb

Browse files
[amdgpu][nfc] Update comments on LDS lowering
1 parent b94f0a9 commit e17c1bb

File tree

1 file changed

+68
-7
lines changed

1 file changed

+68
-7
lines changed

llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp

Lines changed: 68 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -20,9 +20,8 @@
2020
// This model means the GPU runtime can specify the amount of memory allocated.
2121
// If this is more than the kernel assumed, the excess can be made available
2222
// using a language specific feature, which IR represents as a variable with
23-
// no initializer. This feature is not yet implemented for non-kernel functions.
24-
// This lowering could be extended to handle that use case, but would probably
25-
// require closer integration with promoteAllocaToLDS.
23+
// no initializer. This feature is referred to here as "Dynamic LDS" and is
24+
// lowered slightly differently to the normal case.
2625
//
2726
// Consequences of this GPU feature:
2827
// - memory is limited and exceeding it halts compilation
@@ -65,10 +64,10 @@
6564
// Kernel | Yes | Yes | No |
6665
// Hybrid | Yes | Partial | Yes |
6766
//
68-
// Module spends LDS memory to save cycles. Table spends cycles and global
69-
// memory to save LDS. Kernel is as fast as kernel allocation but only works
70-
// for variables that are known reachable from a single kernel. Hybrid picks
71-
// between all three. When forced to choose between LDS and cycles it minimises
67+
// "Module" spends LDS memory to save cycles. "Table" spends cycles and global
68+
// memory to save LDS. "Kernel" is as fast as kernel allocation but only works
69+
// for variables that are known reachable from a single kernel. "Hybrid" picks
70+
// between all three. When forced to choose between LDS and cycles we minimise
7271
// LDS use.
7372

7473
// The "module" lowering implemented here finds LDS variables which are used by
@@ -115,6 +114,68 @@
115114
// use LDS are expected to hit the "Kernel" lowering strategy
116115
// - The runtime properties impose a cost in compiler implementation complexity
117116
//
117+
// Dynamic LDS implementation
118+
// Dynamic LDS is lowered similarly to the "table" strategy above and uses the
119+
// same intrinsic to identify which kernel is at the root of the dynamic call
120+
// graph. This relies on the specified behaviour that all dynamic LDS variables
121+
// alias one another, i.e. are at the same address, with respect to a given
122+
// kernel. Therefore this pass creates new dynamic LDS variables for each kernel
123+
// that allocates any dynamic LDS and builds a table of addresses out of those.
124+
// The AMDGPUPromoteAlloca pass skips kernels that use dynamic LDS.
125+
// The corresponding optimisation for "kernel" lowering where the table lookup
126+
// is elided is not implemented.
127+
//
128+
//
129+
// Implementation notes / limitations
130+
// A single LDS global variable represents an instance per kernel that can reach
131+
// said variables. This pass essentially specialises said variables per kernel.
132+
// Handling ConstantExpr during the pass complicated this significantly so now
133+
// all ConstantExpr uses of LDS variables are expanded to instructions. This
134+
// may need amending when implementing non-undef initialisers.
135+
//
136+
// Lowering is split between this IR pass and the back end. This pass chooses
137+
// where given variables should be allocated and marks them with metadata,
138+
// MD_absolute_symbol. The backend places the variables in coincidentally the
139+
// same location and raises a fatal error if something has gone awry. This works
140+
// in practice because the only pass between this one and the backend that
141+
// changes LDS is PromoteAlloca and the changes it makes do not conflict.
142+
//
143+
// Addresses are written to constant global arrays based on the same metadata.
144+
//
145+
// The backend lowers LDS variables in the order of traversal of the function.
146+
// This is at odds with the deterministic layout required. The workaround is to
147+
// allocate the fixed-address variables immediately upon starting the function
148+
// where they can be placed as intended. This requires a means of mapping from
149+
// the function to the variables that it allocates. For the module scope lds,
150+
// this is via metadata indicating whether the variable is not required. If a
151+
// pass deletes that metadata, a fatal error on disagreement with the absolute
152+
// symbol metadata will occur. For kernel scope and dynamic, this is by _name_
153+
// correspondence between the function and the variable. It requires the
154+
// kernel to have a name (which is only a limitation for tests in practice) and
155+
// for nothing to rename the corresponding symbols. This is a hazard if the pass
156+
// is run multiple times during debugging. Alternative schemes considered all
157+
// involve bespoke metadata.
158+
//
159+
// If the name correspondence can be replaced, multiple distinct kernels that
160+
// have the same memory layout can map to the same kernel id (as the address
161+
// itself is handled by the absolute symbol metadata) and that will allow more
162+
// uses of the "kernel" style faster lowering and reduce the size of the lookup
163+
// tables.
164+
//
165+
// There is a test that checks this does not fire for a graphics shader. This
166+
// lowering is expected to work for graphics if the isKernel test is changed.
167+
//
168+
// The current markUsedByKernel is sufficient for PromoteAlloca but is elided
169+
// before codegen. Replacing this with an equivalent intrinsic which lasts until
170+
// shortly after the machine function lowering of LDS would help break the name
171+
// mapping. The other part needed is probably to amend PromoteAlloca to embed
172+
// the LDS variables it creates in the same struct created here. That avoids the
173+
// current hazard where a PromoteAlloca LDS variable might be allocated before
174+
// the kernel scope (and thus error on the address check). Given a new invariant
175+
// that no LDS variables exist outside of the structs managed here, and an
176+
// intrinsic that lasts until after the LDS frame lowering, it should be
177+
// possible to drop the name mapping and fold equivalent memory layouts.
178+
//
118179
//===----------------------------------------------------------------------===//
119180

120181
#include "AMDGPU.h"

0 commit comments

Comments
 (0)