|
| 1 | +Example 01: Vector addition |
| 2 | +=============================== |
| 3 | + |
| 4 | +This trivial example can be used to compare a simple vector addition in |
| 5 | +CUDA to an equivalent implementation in SYCL for CUDA. |
| 6 | +The aim of the example is also to highlight how to build an application |
| 7 | +with SYCL for CUDA using DPC++ support, for which an example CMakefile is |
| 8 | +provided. |
| 9 | +For detailed documentation on how to migrate from CUDA to SYCL, see |
| 10 | +[SYCL For CUDA Developers](https://developer.codeplay.com/products/computecpp/ce/guides/sycl-for-cuda-developers). |
| 11 | + |
| 12 | +Note currently the CUDA backend does not support the |
| 13 | +[USM](https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/USM/USM.adoc) |
| 14 | +extension, so we use `sycl::buffer` and `sycl::accessors` instead. |
| 15 | + |
| 16 | +Pre-requisites |
| 17 | +--------------- |
| 18 | + |
| 19 | +You would need an installation of DPC++ with CUDA support, |
| 20 | +see [Getting Started Guide](https://github.com/codeplaysoftware/sycl-for-cuda/blob/cuda/sycl/doc/GetStartedWithSYCLCompiler.md) |
| 21 | +for details on how to build it. |
| 22 | + |
| 23 | +The example has been built on CMake 3.13.3 and nvcc 10.1.243. |
| 24 | + |
| 25 | +Building the example |
| 26 | +--------------------- |
| 27 | + |
| 28 | +```sh |
| 29 | +$ mkdir build && cd build` |
| 30 | +$ cmake ../ -DSYCL_ROOT=/path/to/dpc++/install \ |
| 31 | + -DCMAKE_CXX_COMPILER=/path/to/dpc++/install/bin/clang++ |
| 32 | +$ make -j 8 |
| 33 | +``` |
| 34 | + |
| 35 | +This should produce two binaries, `vector_addition` and `sycl_vector_addition`. |
| 36 | +The former is the unmodified CUDA source and the second is the SYCL for CUDA |
| 37 | +version. |
| 38 | + |
| 39 | +Running the example |
| 40 | +-------------------- |
| 41 | + |
| 42 | +The path to `libsycl.so` and the PI plugins must be in `LD_LIBRARY_PATH`. |
| 43 | +A simple way of running the app is as follows: |
| 44 | + |
| 45 | +``` |
| 46 | +$ LD_LIBRARY_PATH=$HOME/open-source/sycl4cuda/lib ./sycl_vector_addition |
| 47 | +``` |
| 48 | +
|
| 49 | +Note the `SYCL_BE` env variable is not required, since we use a custom |
| 50 | +device selector. |
| 51 | +
|
| 52 | +CMake Build script |
| 53 | +------------------------ |
| 54 | +
|
| 55 | +The provided CMake build script uses the native CUDA support to build the |
| 56 | +CUDA application. It also serves as a check that all CUDA requirements |
| 57 | +on the system are available (such as an installation of CUDA on the system). |
| 58 | +
|
| 59 | +Two flags are required: `-DSYCL_ROOT`, which must point to the place where the |
| 60 | +DPC++ compiler is installed, and `-DCMAKE_CXX_COMPILER`, which must point to |
| 61 | +the Clang compiler provided by DPC++. |
| 62 | +
|
| 63 | +The CMake target `sycl_vector_addition` will build the SYCL version of |
| 64 | +the application. |
| 65 | +Note the variable `SYCL_FLAGS` is used to store the Clang flags that enable |
| 66 | +the compilation of a SYCL application (`-fsycl`) but also the flag that specify |
| 67 | +which targets are built (`-fsycl-targets`). |
| 68 | +In this case, we will build the example for both NVPTX and SPIR64. |
| 69 | +This means the kernel for the vector addition will be compiled for both |
| 70 | +backends, and runtime selection to the right queue will decide which variant |
| 71 | +to use. |
| 72 | +
|
| 73 | +Note the project is built with C++17 support, which enables the usage of |
| 74 | +[deduction guides](https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/deduction_guides/SYCL_INTEL_deduction_guides.asciidoc) to reduce the number of template parameters used. |
| 75 | +
|
| 76 | +SYCL Vector Addition code |
| 77 | +-------------------------- |
| 78 | +
|
| 79 | +The vector addition example uses a simple approach to implement with a plain |
| 80 | +kernel that performs the add. Vectors are stored directly in buffers. |
| 81 | +Data is initialized on the host using host accessors. |
| 82 | +This approach avoids creating unnecessary storage on the host, and facilitates |
| 83 | +the SYCL runtime to use optimized memory paths. |
| 84 | +
|
| 85 | +The SYCL queue created later on uses a custom `CUDASelector` to select |
| 86 | +a CUDA device, or bail out if its not there. |
| 87 | +The CUDA selector uses the `info::device::driver_version` to identify the |
| 88 | +device exported by the CUDA backend. |
| 89 | +If the NVIDIA OpenCL implementation is available on the |
| 90 | +system, it will be reported as another SYCL device. The driver |
| 91 | +version is the best way to differentiate between the two. |
| 92 | +
|
| 93 | +The command group is created as a lambda expression that takes the |
| 94 | +`sycl::handler` parameter. Accessors are obtained from buffers using the |
| 95 | +`get_access` method. |
| 96 | +Finally the `parallel_for` with the SYCL kernel is invoked as usual. |
| 97 | +
|
| 98 | +The command group is submitted to a queue which will convert all the |
| 99 | +operations into CUDA commands that will be executed once the host accessor |
| 100 | +is encountered later on. |
| 101 | +
|
| 102 | +The host accessor will trigger a copy of the data back to the host, and |
| 103 | +then the values are reduced into a single sum element. |
0 commit comments