|
| 1 | +<% |
| 2 | + OneApi=tags['$OneApi'] |
| 3 | + x=tags['$x'] |
| 4 | + X=x.upper() |
| 5 | +%> |
| 6 | + |
| 7 | +========================== |
| 8 | +CUDA UR Reference Document |
| 9 | +========================== |
| 10 | + |
| 11 | +This document gives general guidelines of how to use UR to load and build |
| 12 | +programs, and execute kernels on a CUDA device. |
| 13 | + |
| 14 | +Device code |
| 15 | +=========== |
| 16 | + |
| 17 | +A CUDA device image may be made of PTX and/or SASS, two different kinds of |
| 18 | +device code for NVIDIA GPUs. |
| 19 | + |
| 20 | +CUDA device images can be generated by a CUDA-capable compiler toolchain. Most |
| 21 | +CUDA compiler toolchains are capable of generating PTX, SASS and/or bundles of |
| 22 | +PTX and SASS. |
| 23 | + |
| 24 | +PTX |
| 25 | +--- |
| 26 | + |
| 27 | +PTX is a high level NVIDIA ISA which can be JIT compiled at runtime by the CUDA |
| 28 | +driver. In UR, this JIT compilation happens at ${x}ProgramBuild, where PTX is |
| 29 | +assembled into device specific SASS which then can run on device. |
| 30 | + |
| 31 | +PTX is forward compatible, so PTX generated for ``.target sm_52`` will be JIT |
| 32 | +compiled without issue for devices with a greater compute capability than |
| 33 | +``sm_52``. Whereas PTX generated for ``sm_80`` cannot be JIT compiled for an |
| 34 | +``sm_60`` device. |
| 35 | + |
| 36 | +An advantage of using PTX over SASS is that one code can run on multiple |
| 37 | +devices. However, PTX generated for an older arch may not give access to newer |
| 38 | +hardware instructions, such as new atomic operations, or tensor core |
| 39 | +instructions. |
| 40 | + |
| 41 | +JIT compilation has some overhead at ${x}ProgramBuild, especially if the program |
| 42 | +that is being loaded contains multiple kernels. The ``ptxjitcompiler`` keeps a |
| 43 | +JIT cache, however, so this overhead is only paid the first time that a program |
| 44 | +is built. JIT caching may be turned off by setting the environment variable |
| 45 | +``CUDA_CACHE_DISABLE=1``. |
| 46 | + |
| 47 | +SASS |
| 48 | +---- |
| 49 | + |
| 50 | +SASS is a device specific binary which may be produced by ``ptxas`` or some |
| 51 | +other tool. SASS is specific to an individual arch and is not portable across |
| 52 | +arches. |
| 53 | + |
| 54 | +A SASS file may be stored as a ``.cubin`` file by NVIDIA tools. |
| 55 | + |
| 56 | +UR Programs |
| 57 | +=========== |
| 58 | + |
| 59 | +A ${x}_program_handle_t has a one to one mapping with the CUDA driver object |
| 60 | +`CUModule <https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MODULE.html#group__CUDA__MODULE>`_. |
| 61 | + |
| 62 | +In UR for CUDA, a ${x}_program_handle_t can be created using |
| 63 | +${x}ProgramCreateWithBinary with: |
| 64 | + |
| 65 | +* A single PTX module, stored as a null terminated ``uint8_t`` buffer. |
| 66 | +* A single SASS module, stored as an opaque ``uint8_t`` buffer. |
| 67 | +* A mixed PTX/SASS module, where the SASS module is the assembled PTX module. |
| 68 | + |
| 69 | +A ${x}_program_handle_t is valid only for a single architecture. If a CUDA |
| 70 | +compatible binary contains device code for multiple NVIDIA architectures, it is |
| 71 | +the user's responsibility to split these separate device images so that |
| 72 | +${x}ProgramCreateWithBinary is only called with a device binary for a single |
| 73 | +device arch. |
| 74 | + |
| 75 | +If a program is large and contains many kernels, loading and/or JIT compiling |
| 76 | +the program may have a high overhead. This can be mitigated by splitting a |
| 77 | +program into multiple smaller programs (corresponding to PTX/SASS files). In |
| 78 | +this way, an application will only pay the overhead of loading/compiling |
| 79 | +kernels that it will likely use. |
| 80 | + |
| 81 | +Using PTX Modules in UR |
| 82 | +----------------------- |
| 83 | + |
| 84 | +A PTX module will be loaded and JIT compiled for the necessary architecture at |
| 85 | +${x}ProgramBuild. If the PTX module has been generated for a compute capability |
| 86 | +greater than the compute capability of the device, then ${x}ProgramBuild will |
| 87 | +fail with the error ``CUDA_ERROR_NO_BINARY_FOR_GPU``. |
| 88 | + |
| 89 | +A PTX module passed to ${x}ProgramBuild must contain only one PTX file. |
| 90 | +Separate PTX files are to be handled separately. |
| 91 | + |
| 92 | +Arguments may be passed to the ``ptxjitcompiler`` via ${x}ProgramBuild. |
| 93 | +Currently ``maxrregcount`` is the only supported argument. |
| 94 | + |
| 95 | +.. parsed-literal:: |
| 96 | +
|
| 97 | + ${x}ProgramBuild(ctx, program, "maxrregcount=128"); |
| 98 | +
|
| 99 | +
|
| 100 | +Using SASS Modules in UR |
| 101 | +------------------------ |
| 102 | + |
| 103 | +A SASS module will be loaded and checked for compatibility at ${x}ProgramBuild. |
| 104 | +If the SASS module is incompatible with the device arch then ${x}ProgramBuild |
| 105 | +will fail with the error ``CUDA_ERROR_NO_BINARY_FOR_GPU``. |
| 106 | + |
| 107 | +Using Mixed PTX/SASS Bundles in UR |
| 108 | +---------------------------------- |
| 109 | + |
| 110 | +Mixed PTX/SASS modules can be used to make a program with |
| 111 | +${x}ProgramCreateWithBinary. At ${x}ProgramBuild the CUDA driver will check |
| 112 | +whether the bundled SASS is compatible with the active device. If the SASS is |
| 113 | +compatible then the ${x}_program_handle_t will be built from the SASS, and if |
| 114 | +not then the PTX will be used as a fallback and JIT compiled by the CUDA |
| 115 | +driver. If both PTX and SASS are incompatible with the active device then |
| 116 | +${x}ProgramBuild will fail with the error ``CUDA_ERROR_NO_BINARY_FOR_GPU``. |
| 117 | + |
| 118 | +UR Kernels |
| 119 | +========== |
| 120 | + |
| 121 | +Once ${x}ProgramCreateWithBinary and ${x}ProgramBuild have succeeded, kernels |
| 122 | +can be fetched from programs with ${x}KernelCreate. ${x}KernelCreate must be |
| 123 | +called with the exact name of the kernel in the PTX/SASS module. This name will |
| 124 | +depend on the mangling used when compiling the kernel, so it is recommended to |
| 125 | +examine the symbols in the PTX/SASS module before trying to extract kernels in |
| 126 | +UR. |
| 127 | + |
| 128 | +.. code-block:: console |
| 129 | +
|
| 130 | + $ cuobjdump --dump-elf-symbols hello.cubin | grep mykernel |
| 131 | + _Z13mykernelv |
| 132 | +
|
| 133 | +At present it is not possible to query the names of the kernels in a UR program |
| 134 | +for CUDA, so it is necessary to know the (mangled or otherwise) names of kernels |
| 135 | +in advance or by some other means. |
| 136 | + |
| 137 | +UR kernels can be dispatched with ${x}EnqueueKernelLaunch. The argument |
| 138 | +``pGlobalWorkOffset`` can only be used if the kernels have been instrumented to |
| 139 | +take the extra global offset argument. Use of the global offset is not |
| 140 | +recommended for non SYCL compiler toolchains. This parameter can be ignored if |
| 141 | +the user does not wish to use the global offset. |
| 142 | + |
| 143 | +Other Notes |
| 144 | +=========== |
| 145 | + |
| 146 | +- The environment variable ``SYCL_PI_CUDA_MAX_LOCAL_MEM_SIZE`` can be set in |
| 147 | + order to exceed the default max dynamic local memory size. More information |
| 148 | + can be found |
| 149 | + `here <https://intel.github.io/llvm-docs/EnvironmentVariables.html#controlling-dpc-cuda-plugin>`_. |
| 150 | +- The size of primitive datatypes may differ in host and device code. For |
| 151 | + instance, NVCC treats ``long double`` as 8 bytes for device and 16 bytes for |
| 152 | + host. |
| 153 | +- In kernel ``printf`` for NVPTX targets does not support the ``%z`` modifier. |
| 154 | + |
| 155 | +Contributors |
| 156 | +------------ |
| 157 | + |
| 158 | +* Hugh Delaney `hugh.delaney@codeplay.com <hugh.delaney@codeplay.com>`_ |
| 159 | + |
0 commit comments