Add initial CUDA and HIP usage guides

Hugh Delaney · Hugh Delaney · commit 8edc7e2b7f64 · 2023-09-11T15:42:00.000+01:00
diff --git a/scripts/core/CUDA.rst b/scripts/core/CUDA.rst
@@ -0,0 +1,159 @@
+<%
+    OneApi=tags['$OneApi']
+    x=tags['$x']
+    X=x.upper()
+%>
+
+==========================
+CUDA UR Reference Document
+==========================
+
+This document gives general guidelines of how to use UR to load and build
+programs, and execute kernels on a CUDA device.
+
+Device code
+===========
+
+A CUDA device image may be made of PTX and/or SASS, two different kinds of
+device code for NVIDIA GPUs.
+
+CUDA device images can be generated by a CUDA-capable compiler toolchain. Most
+CUDA compiler toolchains are capable of generating PTX, SASS and/or bundles of
+PTX and SASS.
+
+PTX
+---
+
+PTX is a high level NVIDIA ISA which can be JIT compiled at runtime by the CUDA
+driver. In UR, this JIT compilation happens at ${x}ProgramBuild, where PTX is
+assembled into device specific SASS which then can run on device.
+
+PTX is forward compatible, so PTX generated for ``.target sm_52`` will be JIT
+compiled without issue for devices with a greater compute capability than
+``sm_52``. Whereas PTX generated for ``sm_80`` cannot be JIT compiled for an
+``sm_60`` device.
+
+An advantage of using PTX over SASS is that one code can run on multiple
+devices. However, PTX generated for an older arch may not give access to newer
+hardware instructions, such as new atomic operations, or tensor core
+instructions.
+
+JIT compilation has some overhead at ${x}ProgramBuild, especially if the program
+that is being loaded contains multiple kernels. The ``ptxjitcompiler`` keeps a
+JIT cache, however, so this overhead is only paid the first time that a program
+is built. JIT caching may be turned off by setting the environment variable
+``CUDA_CACHE_DISABLE=1``.
+
+SASS
+----
+
+SASS is a device specific binary which may be produced by ``ptxas`` or some
+other tool. SASS is specific to an individual arch and is not portable across
+arches.
+
+A SASS file may be stored as a ``.cubin`` file by NVIDIA tools.
+
+UR Programs
+===========
+
+A ${x}_program_handle_t has a one to one mapping with the CUDA driver object
+`CUModule <https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MODULE.html#group__CUDA__MODULE>`_.
+
+In UR for CUDA, a ${x}_program_handle_t can be created using
+${x}ProgramCreateWithBinary with:
+
+* A single PTX module, stored as a null terminated ``uint8_t`` buffer.
+* A single SASS module, stored as an opaque ``uint8_t`` buffer.
+* A mixed PTX/SASS module, where the SASS module is the assembled PTX module.
+
+A ${x}_program_handle_t is valid only for a single architecture. If a CUDA
+compatible binary contains device code for multiple NVIDIA architectures, it is
+the user's responsibility to split these separate device images so that
+${x}ProgramCreateWithBinary is only called with a device binary for a single
+device arch.
+
+If a program is large and contains many kernels, loading and/or JIT compiling
+the program may have a high overhead. This can be mitigated by splitting a
+program into multiple smaller programs (corresponding to PTX/SASS files). In
+this way, an application will only pay the overhead of loading/compiling
+kernels that it will likely use.
+
+Using PTX Modules in UR
+-----------------------
+
+A PTX module will be loaded and JIT compiled for the necessary architecture at
+${x}ProgramBuild. If the PTX module has been generated for a compute capability
+greater than the compute capability of the device, then ${x}ProgramBuild will
+fail with the error ``CUDA_ERROR_NO_BINARY_FOR_GPU``.
+
+A PTX module passed to ${x}ProgramBuild must contain only one PTX file.
+Separate PTX files are to be handled separately.
+
+Arguments may be passed to the ``ptxjitcompiler`` via ${x}ProgramBuild.
+Currently ``maxrregcount`` is the only supported argument.
+
+.. parsed-literal::
+
+   ${x}ProgramBuild(ctx, program, "maxrregcount=128");
+
+
+Using SASS Modules in UR
+------------------------
+
+A SASS module will be loaded and checked for compatibility at ${x}ProgramBuild.
+If the SASS module is incompatible with the device arch then ${x}ProgramBuild
+will fail with the error ``CUDA_ERROR_NO_BINARY_FOR_GPU``.
+
+Using Mixed PTX/SASS Bundles in UR
+----------------------------------
+
+Mixed PTX/SASS modules can be used to make a program with
+${x}ProgramCreateWithBinary. At ${x}ProgramBuild the CUDA driver will check
+whether the bundled SASS is compatible with the active device. If the SASS is
+compatible then the ${x}_program_handle_t will be built from the SASS, and if
+not then the PTX will be used as a fallback and JIT compiled by the CUDA
+driver. If both PTX and SASS are incompatible with the active device then
+${x}ProgramBuild will fail with the error ``CUDA_ERROR_NO_BINARY_FOR_GPU``.
+
+UR Kernels
+==========
+
+Once ${x}ProgramCreateWithBinary and ${x}ProgramBuild have succeeded, kernels
+can be fetched from programs with ${x}KernelCreate. ${x}KernelCreate must be
+called with the exact name of the kernel in the PTX/SASS module. This name will
+depend on the mangling used when compiling the kernel, so it is recommended to
+examine the symbols in the PTX/SASS module before trying to extract kernels in
+UR.
+
+.. code-block:: console
+
+    $ cuobjdump --dump-elf-symbols hello.cubin | grep mykernel
+    _Z13mykernelv
+
+At present it is not possible to query the names of the kernels in a UR program
+for CUDA, so it is necessary to know the (mangled or otherwise) names of kernels
+in advance or by some other means.
+
+UR kernels can be dispatched with ${x}EnqueueKernelLaunch. The argument
+``pGlobalWorkOffset`` can only be used if the kernels have been instrumented to
+take the extra global offset argument. Use of the global offset is not
+recommended for non SYCL compiler toolchains. This parameter can be ignored if
+the user does not wish to use the global offset.
+
+Other Notes
+===========
+
+- The environment variable ``SYCL_PI_CUDA_MAX_LOCAL_MEM_SIZE`` can be set in
+  order to exceed the default max dynamic local memory size. More information
+  can be found
+  `here <https://intel.github.io/llvm-docs/EnvironmentVariables.html#controlling-dpc-cuda-plugin>`_.
+- The size of primitive datatypes may differ in host and device code. For
+  instance, NVCC treats ``long double`` as 8 bytes for device and 16 bytes for
+  host.
+- In kernel ``printf`` for NVPTX targets does not support the ``%z`` modifier.
+
+Contributors
+------------
+
+* Hugh Delaney `hugh.delaney@codeplay.com <hugh.delaney@codeplay.com>`_
+
diff --git a/scripts/core/HIP.rst b/scripts/core/HIP.rst
@@ -0,0 +1,95 @@
+<%
+    OneApi=tags['$OneApi']
+    x=tags['$x']
+    X=x.upper()
+%>
+
+=============================
+AMD HIP UR Reference Document
+=============================
+
+This document gives general guidelines of how to use UR to execute kernels on
+a AMD HIP device.
+
+Device code
+===========
+
+Unlike the NVPTX platform, AMDGPU does not use a device IR that can be JIT
+compiled at runtime. Therefore, all device binaries must be precompiled for a
+particular arch.
+
+The naming of AMDGPU device code files may vary across different generations
+of devices. ``.hsa`` or ``.hsaco`` are common extensions as of 2023.
+
+HIPCC can generate device code for a particular arch using the ``--genco`` flag
+
+.. code-block:: console
+
+    $ hipcc --genco hello.cu --amdgpu-target=gfx906 -o hello.hsaco
+
+UR Programs
+===========
+
+A ${x}_program_handle_t has a one to one mapping with the HIP runtime object
+`hipModule_t <https://docs.amd.com/projects/HIP/en/latest/.doxygen/docBin/html/group___module.html>`__
+
+In UR for HIP, a ${x}_program_handle_t can be created using
+${x}ProgramCreateWithBinary with:
+
+* A single device code module
+
+A ${x}_program_handle_t is valid only for a single architecture. If a HIP
+compatible binary contains device code for multiple AMDGPU architectures, it is
+the user's responsibility to split these separate device images so that
+${x}ProgramCreateWithBinary is only called with a device binary for a single
+device arch.
+
+If the AMDGPU module is incompatible with the device arch then ${x}ProgramBuild
+will fail with the error ``hipErrorNoBinaryForGpu``.
+
+If a program is large and contains many kernels, loading the program may have a
+high overhead. This can be mitigated by splitting a program into multiple
+smaller programs. In this way, an application will only pay the overhead of
+loading kernels that it will likely use.
+
+Kernels
+=======
+
+Once ${x}ProgramCreateWithBinary and ${x}ProgramBuild have succeeded, kernels
+can be fetched from programs with ${x}KernelCreate. ${x}KernelCreate must be
+called with the exact name of the kernel in the AMDGPU device code module. This
+name will depend on the mangling used when compiling the kernel, so it is
+recommended to examine the symbols in the AMDGPU device code module before
+trying to extract kernels in UR code.
+
+``llvm-objdump`` or ``readelf`` may not correctly view the symbols in an AMDGPU
+device module. It may be necessary to call ``clang-offload-bundler`` first in
+order to extract the ``ELF`` file that can be passed to ``readelf``.
+
+.. code-block:: console
+
+    $ clang-offload-bundler --unbundle --input=hello.hsaco --output=hello.o \
+        --targets=hipv4-amdgcn-amd-amdhsa--gfx906 --type=o
+    $ readelf hello.o -s | grep mykernel
+    _Z13mykernelv
+
+At present it is not possible to query the names of the kernels in a UR program
+for HIP, so it is necessary to know the (mangled or otherwise) names of kernels
+in advance or by some other means.
+
+UR kernels can be dispatched with ${x}EnqueueKernelLaunch. The argument
+``pGlobalWorkOffset`` can only be used if the kernels have been instrumented to
+take the extra global offset argument. Use of the global offset is not
+recommended for non SYCL compiler toolchains. This parameter can be ignored if
+the user does not wish to use the global offset.
+
+Other Notes
+===========
+
+- In kernel ``printf`` may not work for certain ROCm versions.
+
+Contributors
+------------
+
+* Hugh Delaney `hugh.delaney@codeplay.com <hugh.delaney@codeplay.com>`_
+
diff --git a/scripts/templates/index.rst.mako b/scripts/templates/index.rst.mako
@@ -14,5 +14,7 @@
    core/INTRO.rst
    core/PROG.rst
    core/CONTRIB.rst
+   core/CUDA.rst
+   core/HIP.rst
    exp-features.rst
    api.rst