Skip to content

Commit 8edc7e2

Browse files
author
Hugh Delaney
committed
Add initial CUDA and HIP usage guides
1 parent a3120e7 commit 8edc7e2

File tree

3 files changed

+256
-0
lines changed

3 files changed

+256
-0
lines changed

scripts/core/CUDA.rst

Lines changed: 159 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,159 @@
1+
<%
2+
OneApi=tags['$OneApi']
3+
x=tags['$x']
4+
X=x.upper()
5+
%>
6+
7+
==========================
8+
CUDA UR Reference Document
9+
==========================
10+
11+
This document gives general guidelines of how to use UR to load and build
12+
programs, and execute kernels on a CUDA device.
13+
14+
Device code
15+
===========
16+
17+
A CUDA device image may be made of PTX and/or SASS, two different kinds of
18+
device code for NVIDIA GPUs.
19+
20+
CUDA device images can be generated by a CUDA-capable compiler toolchain. Most
21+
CUDA compiler toolchains are capable of generating PTX, SASS and/or bundles of
22+
PTX and SASS.
23+
24+
PTX
25+
---
26+
27+
PTX is a high level NVIDIA ISA which can be JIT compiled at runtime by the CUDA
28+
driver. In UR, this JIT compilation happens at ${x}ProgramBuild, where PTX is
29+
assembled into device specific SASS which then can run on device.
30+
31+
PTX is forward compatible, so PTX generated for ``.target sm_52`` will be JIT
32+
compiled without issue for devices with a greater compute capability than
33+
``sm_52``. Whereas PTX generated for ``sm_80`` cannot be JIT compiled for an
34+
``sm_60`` device.
35+
36+
An advantage of using PTX over SASS is that one code can run on multiple
37+
devices. However, PTX generated for an older arch may not give access to newer
38+
hardware instructions, such as new atomic operations, or tensor core
39+
instructions.
40+
41+
JIT compilation has some overhead at ${x}ProgramBuild, especially if the program
42+
that is being loaded contains multiple kernels. The ``ptxjitcompiler`` keeps a
43+
JIT cache, however, so this overhead is only paid the first time that a program
44+
is built. JIT caching may be turned off by setting the environment variable
45+
``CUDA_CACHE_DISABLE=1``.
46+
47+
SASS
48+
----
49+
50+
SASS is a device specific binary which may be produced by ``ptxas`` or some
51+
other tool. SASS is specific to an individual arch and is not portable across
52+
arches.
53+
54+
A SASS file may be stored as a ``.cubin`` file by NVIDIA tools.
55+
56+
UR Programs
57+
===========
58+
59+
A ${x}_program_handle_t has a one to one mapping with the CUDA driver object
60+
`CUModule <https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MODULE.html#group__CUDA__MODULE>`_.
61+
62+
In UR for CUDA, a ${x}_program_handle_t can be created using
63+
${x}ProgramCreateWithBinary with:
64+
65+
* A single PTX module, stored as a null terminated ``uint8_t`` buffer.
66+
* A single SASS module, stored as an opaque ``uint8_t`` buffer.
67+
* A mixed PTX/SASS module, where the SASS module is the assembled PTX module.
68+
69+
A ${x}_program_handle_t is valid only for a single architecture. If a CUDA
70+
compatible binary contains device code for multiple NVIDIA architectures, it is
71+
the user's responsibility to split these separate device images so that
72+
${x}ProgramCreateWithBinary is only called with a device binary for a single
73+
device arch.
74+
75+
If a program is large and contains many kernels, loading and/or JIT compiling
76+
the program may have a high overhead. This can be mitigated by splitting a
77+
program into multiple smaller programs (corresponding to PTX/SASS files). In
78+
this way, an application will only pay the overhead of loading/compiling
79+
kernels that it will likely use.
80+
81+
Using PTX Modules in UR
82+
-----------------------
83+
84+
A PTX module will be loaded and JIT compiled for the necessary architecture at
85+
${x}ProgramBuild. If the PTX module has been generated for a compute capability
86+
greater than the compute capability of the device, then ${x}ProgramBuild will
87+
fail with the error ``CUDA_ERROR_NO_BINARY_FOR_GPU``.
88+
89+
A PTX module passed to ${x}ProgramBuild must contain only one PTX file.
90+
Separate PTX files are to be handled separately.
91+
92+
Arguments may be passed to the ``ptxjitcompiler`` via ${x}ProgramBuild.
93+
Currently ``maxrregcount`` is the only supported argument.
94+
95+
.. parsed-literal::
96+
97+
${x}ProgramBuild(ctx, program, "maxrregcount=128");
98+
99+
100+
Using SASS Modules in UR
101+
------------------------
102+
103+
A SASS module will be loaded and checked for compatibility at ${x}ProgramBuild.
104+
If the SASS module is incompatible with the device arch then ${x}ProgramBuild
105+
will fail with the error ``CUDA_ERROR_NO_BINARY_FOR_GPU``.
106+
107+
Using Mixed PTX/SASS Bundles in UR
108+
----------------------------------
109+
110+
Mixed PTX/SASS modules can be used to make a program with
111+
${x}ProgramCreateWithBinary. At ${x}ProgramBuild the CUDA driver will check
112+
whether the bundled SASS is compatible with the active device. If the SASS is
113+
compatible then the ${x}_program_handle_t will be built from the SASS, and if
114+
not then the PTX will be used as a fallback and JIT compiled by the CUDA
115+
driver. If both PTX and SASS are incompatible with the active device then
116+
${x}ProgramBuild will fail with the error ``CUDA_ERROR_NO_BINARY_FOR_GPU``.
117+
118+
UR Kernels
119+
==========
120+
121+
Once ${x}ProgramCreateWithBinary and ${x}ProgramBuild have succeeded, kernels
122+
can be fetched from programs with ${x}KernelCreate. ${x}KernelCreate must be
123+
called with the exact name of the kernel in the PTX/SASS module. This name will
124+
depend on the mangling used when compiling the kernel, so it is recommended to
125+
examine the symbols in the PTX/SASS module before trying to extract kernels in
126+
UR.
127+
128+
.. code-block:: console
129+
130+
$ cuobjdump --dump-elf-symbols hello.cubin | grep mykernel
131+
_Z13mykernelv
132+
133+
At present it is not possible to query the names of the kernels in a UR program
134+
for CUDA, so it is necessary to know the (mangled or otherwise) names of kernels
135+
in advance or by some other means.
136+
137+
UR kernels can be dispatched with ${x}EnqueueKernelLaunch. The argument
138+
``pGlobalWorkOffset`` can only be used if the kernels have been instrumented to
139+
take the extra global offset argument. Use of the global offset is not
140+
recommended for non SYCL compiler toolchains. This parameter can be ignored if
141+
the user does not wish to use the global offset.
142+
143+
Other Notes
144+
===========
145+
146+
- The environment variable ``SYCL_PI_CUDA_MAX_LOCAL_MEM_SIZE`` can be set in
147+
order to exceed the default max dynamic local memory size. More information
148+
can be found
149+
`here <https://intel.github.io/llvm-docs/EnvironmentVariables.html#controlling-dpc-cuda-plugin>`_.
150+
- The size of primitive datatypes may differ in host and device code. For
151+
instance, NVCC treats ``long double`` as 8 bytes for device and 16 bytes for
152+
host.
153+
- In kernel ``printf`` for NVPTX targets does not support the ``%z`` modifier.
154+
155+
Contributors
156+
------------
157+
158+
* Hugh Delaney `hugh.delaney@codeplay.com <hugh.delaney@codeplay.com>`_
159+

scripts/core/HIP.rst

Lines changed: 95 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,95 @@
1+
<%
2+
OneApi=tags['$OneApi']
3+
x=tags['$x']
4+
X=x.upper()
5+
%>
6+
7+
=============================
8+
AMD HIP UR Reference Document
9+
=============================
10+
11+
This document gives general guidelines of how to use UR to execute kernels on
12+
a AMD HIP device.
13+
14+
Device code
15+
===========
16+
17+
Unlike the NVPTX platform, AMDGPU does not use a device IR that can be JIT
18+
compiled at runtime. Therefore, all device binaries must be precompiled for a
19+
particular arch.
20+
21+
The naming of AMDGPU device code files may vary across different generations
22+
of devices. ``.hsa`` or ``.hsaco`` are common extensions as of 2023.
23+
24+
HIPCC can generate device code for a particular arch using the ``--genco`` flag
25+
26+
.. code-block:: console
27+
28+
$ hipcc --genco hello.cu --amdgpu-target=gfx906 -o hello.hsaco
29+
30+
UR Programs
31+
===========
32+
33+
A ${x}_program_handle_t has a one to one mapping with the HIP runtime object
34+
`hipModule_t <https://docs.amd.com/projects/HIP/en/latest/.doxygen/docBin/html/group___module.html>`__
35+
36+
In UR for HIP, a ${x}_program_handle_t can be created using
37+
${x}ProgramCreateWithBinary with:
38+
39+
* A single device code module
40+
41+
A ${x}_program_handle_t is valid only for a single architecture. If a HIP
42+
compatible binary contains device code for multiple AMDGPU architectures, it is
43+
the user's responsibility to split these separate device images so that
44+
${x}ProgramCreateWithBinary is only called with a device binary for a single
45+
device arch.
46+
47+
If the AMDGPU module is incompatible with the device arch then ${x}ProgramBuild
48+
will fail with the error ``hipErrorNoBinaryForGpu``.
49+
50+
If a program is large and contains many kernels, loading the program may have a
51+
high overhead. This can be mitigated by splitting a program into multiple
52+
smaller programs. In this way, an application will only pay the overhead of
53+
loading kernels that it will likely use.
54+
55+
Kernels
56+
=======
57+
58+
Once ${x}ProgramCreateWithBinary and ${x}ProgramBuild have succeeded, kernels
59+
can be fetched from programs with ${x}KernelCreate. ${x}KernelCreate must be
60+
called with the exact name of the kernel in the AMDGPU device code module. This
61+
name will depend on the mangling used when compiling the kernel, so it is
62+
recommended to examine the symbols in the AMDGPU device code module before
63+
trying to extract kernels in UR code.
64+
65+
``llvm-objdump`` or ``readelf`` may not correctly view the symbols in an AMDGPU
66+
device module. It may be necessary to call ``clang-offload-bundler`` first in
67+
order to extract the ``ELF`` file that can be passed to ``readelf``.
68+
69+
.. code-block:: console
70+
71+
$ clang-offload-bundler --unbundle --input=hello.hsaco --output=hello.o \
72+
--targets=hipv4-amdgcn-amd-amdhsa--gfx906 --type=o
73+
$ readelf hello.o -s | grep mykernel
74+
_Z13mykernelv
75+
76+
At present it is not possible to query the names of the kernels in a UR program
77+
for HIP, so it is necessary to know the (mangled or otherwise) names of kernels
78+
in advance or by some other means.
79+
80+
UR kernels can be dispatched with ${x}EnqueueKernelLaunch. The argument
81+
``pGlobalWorkOffset`` can only be used if the kernels have been instrumented to
82+
take the extra global offset argument. Use of the global offset is not
83+
recommended for non SYCL compiler toolchains. This parameter can be ignored if
84+
the user does not wish to use the global offset.
85+
86+
Other Notes
87+
===========
88+
89+
- In kernel ``printf`` may not work for certain ROCm versions.
90+
91+
Contributors
92+
------------
93+
94+
* Hugh Delaney `hugh.delaney@codeplay.com <hugh.delaney@codeplay.com>`_
95+

scripts/templates/index.rst.mako

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,5 +14,7 @@
1414
core/INTRO.rst
1515
core/PROG.rst
1616
core/CONTRIB.rst
17+
core/CUDA.rst
18+
core/HIP.rst
1719
exp-features.rst
1820
api.rst

0 commit comments

Comments
 (0)