Skip to content

Commit f78a857

Browse files
Add kernel compilation requirements to docs (#2416)
[only docs]
1 parent cdae2d3 commit f78a857

File tree

1 file changed

+68
-0
lines changed

1 file changed

+68
-0
lines changed

docs/src/development/kernel.md

Lines changed: 68 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -97,6 +97,74 @@ As shown above, the `threadIdx` etc. values from CUDA C are available as functio
9797
a `NamedTuple` with `x`, `y`, and `z` fields. The intrinsics return 1-based indices.
9898

9999

100+
## Kernel compilation requirements
101+
102+
For custom kernels to work they need to need to meet certain requirements.
103+
104+
First, the memory must be accessible on the GPU. This can be enforced by using the correct
105+
types, e.g. CuArray's data with bits type. Custom structs can be ported as described in the
106+
[corresponding tutorial](https://cuda.juliagpu.org/dev/tutorials/custom_structs/).
107+
108+
Second, we are not allowed to have runtime dispatches. All function calls
109+
need to be determined at compile time. Here it is important to note that runtime dispatches
110+
can also be introduced by functions which are not fully specialized. Let us take this example:
111+
112+
```julia-repl
113+
julia> function my_inner_kernel!(f, t) # does not specialize
114+
t .= f.(t)
115+
end
116+
my_inner_kernel! (generic function with 1 method)
117+
118+
julia> function my_outer_kernel(f, a)
119+
i = threadIdx().x
120+
my_inner_kernel!(f, @view a[i, :])
121+
return nothing
122+
end
123+
my_outer_kernel (generic function with 1 method)
124+
125+
julia> a = CUDA.rand(Int, (2,2))
126+
2×2 CuArray{Int64, 2, CUDA.DeviceMemory}:
127+
5153094658246882343 -1636555237989902283
128+
2088126782868946458 -5701665962120018867
129+
130+
julia> id(x) = x
131+
id (generic function with 1 method)
132+
133+
julia> @cuda threads=size(a, 1) my_outer_kernel(id, a)
134+
ERROR: InvalidIRError: compiling MethodInstance for my_outer_kernel(::typeof(id), ::CuDeviceMatrix{Int64, 1}) resulted in invalid LLVM IR
135+
Reason: unsupported dynamic function invocation (call to my_inner_kernel!(f, t) @ Main REPL[27]:1)
136+
```
137+
138+
Here the function `my_inner_kernel!` is not specialized. We can force specialization
139+
in this case as follows:
140+
141+
```julia-repl
142+
julia> function my_inner_kernel2!(f::F, t::T) where {F,T} # forces specialization
143+
t .= f.(t)
144+
end
145+
my_inner_kernel2! (generic function with 1 method)
146+
147+
julia> function my_outer_kernel2(f, a)
148+
i = threadIdx().x
149+
my_inner_kernel2!(f, @view a[i, :])
150+
return nothing
151+
end
152+
my_outer_kernel2 (generic function with 1 method)
153+
154+
julia> a = CUDA.rand(Int, (2,2))
155+
2×2 CuArray{Int64, 2, CUDA.DeviceMemory}:
156+
3193805011610800677 4871385510397812058
157+
-9060544314843886881 8829083170181145736
158+
159+
julia> id(x) = x
160+
id (generic function with 1 method)
161+
162+
julia> @cuda threads=size(a, 1) my_outer_kernel2(id, a)
163+
CUDA.HostKernel for my_outer_kernel2(typeof(id), CuDeviceMatrix{Int64, 1})
164+
```
165+
166+
More cases and details on specialization can be found in [the Julia manual](https://docs.julialang.org/en/v1/manual/performance-tips/#Be-aware-of-when-Julia-avoids-specializing).
167+
100168
## Synchronization
101169

102170
To synchronize threads in a block, use the `sync_threads()` function. More advanced variants

0 commit comments

Comments
 (0)