Skip to content

Commit e1d7bb3

Browse files
Ruykjwlawson
andauthored
Vector addition using SYCL for CUDA (#1)
* Vector addition using SYCL for CUDA Added trivial example of vector addition using SYCL for CUDA support in DPC++. Includes a simple CMake build configuration to call the DPC++ compiler and build with both CUDA and SPIR support. Co-Authored-By: John Lawson <john@codeplay.com>
1 parent c242f0f commit e1d7bb3

File tree

5 files changed

+322
-0
lines changed

5 files changed

+322
-0
lines changed

README

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
SYCL for CUDA examples
2+
==========================
3+
4+
This repository contains examples that demonstrate how to use the CUDA backend
5+
in SYCL.
6+
7+
The examples are built and test in Linux with GCC 7.4, NVCC 10.1 and the
8+
experimental support for CUDA in the DPC++ SYCL implementation.

example-01/CMakeLists.txt

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
cmake_minimum_required(VERSION 3.10 FATAL_ERROR)
2+
project(cmake_and_cuda LANGUAGES CXX CUDA)
3+
4+
include(CTest)
5+
6+
# SYCL installation
7+
if (NOT SYCL_ROOT)
8+
message(FATAL_ERROR "No SYCL installation detected")
9+
endif(NOT SYCL_ROOT)
10+
11+
set(SYCL_INCLUDE_DIR "${SYCL_ROOT}/lib/clang/11.0.0/include/")
12+
set(SYCL_LIB "${SYCL_ROOT}/lib/libsycl.so")
13+
set(SYCL_FLAGS "-fsycl"
14+
"-fsycl-targets=nvptx64-nvidia-cuda-sycldevice,spir64-unknown-linux-sycldevice"
15+
"-fsycl-unnamed-lambda")
16+
17+
# Build the CUDA code
18+
add_executable(vector_addition vector_addition.cu)
19+
target_compile_features(vector_addition PUBLIC cxx_std_11)
20+
set_target_properties(vector_addition PROPERTIES CUDA_SEPARABLE_COMPILATION ON)
21+
set_property(TARGET vector_addition PROPERTY BUILD_RPATH "${CMAKE_CUDA_IMPLICIT_LINK_DIRECTORIES}")
22+
23+
# Build the SYCL code
24+
add_executable (sycl_vector_addition vector_addition.cpp)
25+
target_compile_features(sycl_vector_addition PUBLIC cxx_std_17)
26+
target_compile_options(sycl_vector_addition PUBLIC ${SYCL_FLAGS})
27+
target_link_libraries(sycl_vector_addition PUBLIC ${SYCL_FLAGS})
28+
target_include_directories(sycl_vector_addition PUBLIC ${SYCL_INCLUDE_DIR})
29+
target_link_libraries(sycl_vector_addition PUBLIC ${SYCL_LIB})
30+

example-01/README.md

Lines changed: 103 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,103 @@
1+
Example 01: Vector addition
2+
===============================
3+
4+
This trivial example can be used to compare a simple vector addition in
5+
CUDA to an equivalent implementation in SYCL for CUDA.
6+
The aim of the example is also to highlight how to build an application
7+
with SYCL for CUDA using DPC++ support, for which an example CMakefile is
8+
provided.
9+
For detailed documentation on how to migrate from CUDA to SYCL, see
10+
[SYCL For CUDA Developers](https://developer.codeplay.com/products/computecpp/ce/guides/sycl-for-cuda-developers).
11+
12+
Note currently the CUDA backend does not support the
13+
[USM](https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/USM/USM.adoc)
14+
extension, so we use `sycl::buffer` and `sycl::accessors` instead.
15+
16+
Pre-requisites
17+
---------------
18+
19+
You would need an installation of DPC++ with CUDA support,
20+
see [Getting Started Guide](https://github.com/codeplaysoftware/sycl-for-cuda/blob/cuda/sycl/doc/GetStartedWithSYCLCompiler.md)
21+
for details on how to build it.
22+
23+
The example has been built on CMake 3.13.3 and nvcc 10.1.243.
24+
25+
Building the example
26+
---------------------
27+
28+
```sh
29+
$ mkdir build && cd build`
30+
$ cmake ../ -DSYCL_ROOT=/path/to/dpc++/install \
31+
-DCMAKE_CXX_COMPILER=/path/to/dpc++/install/bin/clang++
32+
$ make -j 8
33+
```
34+
35+
This should produce two binaries, `vector_addition` and `sycl_vector_addition`.
36+
The former is the unmodified CUDA source and the second is the SYCL for CUDA
37+
version.
38+
39+
Running the example
40+
--------------------
41+
42+
The path to `libsycl.so` and the PI plugins must be in `LD_LIBRARY_PATH`.
43+
A simple way of running the app is as follows:
44+
45+
```
46+
$ LD_LIBRARY_PATH=$HOME/open-source/sycl4cuda/lib ./sycl_vector_addition
47+
```
48+
49+
Note the `SYCL_BE` env variable is not required, since we use a custom
50+
device selector.
51+
52+
CMake Build script
53+
------------------------
54+
55+
The provided CMake build script uses the native CUDA support to build the
56+
CUDA application. It also serves as a check that all CUDA requirements
57+
on the system are available (such as an installation of CUDA on the system).
58+
59+
Two flags are required: `-DSYCL_ROOT`, which must point to the place where the
60+
DPC++ compiler is installed, and `-DCMAKE_CXX_COMPILER`, which must point to
61+
the Clang compiler provided by DPC++.
62+
63+
The CMake target `sycl_vector_addition` will build the SYCL version of
64+
the application.
65+
Note the variable `SYCL_FLAGS` is used to store the Clang flags that enable
66+
the compilation of a SYCL application (`-fsycl`) but also the flag that specify
67+
which targets are built (`-fsycl-targets`).
68+
In this case, we will build the example for both NVPTX and SPIR64.
69+
This means the kernel for the vector addition will be compiled for both
70+
backends, and runtime selection to the right queue will decide which variant
71+
to use.
72+
73+
Note the project is built with C++17 support, which enables the usage of
74+
[deduction guides](https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/deduction_guides/SYCL_INTEL_deduction_guides.asciidoc) to reduce the number of template parameters used.
75+
76+
SYCL Vector Addition code
77+
--------------------------
78+
79+
The vector addition example uses a simple approach to implement with a plain
80+
kernel that performs the add. Vectors are stored directly in buffers.
81+
Data is initialized on the host using host accessors.
82+
This approach avoids creating unnecessary storage on the host, and facilitates
83+
the SYCL runtime to use optimized memory paths.
84+
85+
The SYCL queue created later on uses a custom `CUDASelector` to select
86+
a CUDA device, or bail out if its not there.
87+
The CUDA selector uses the `info::device::driver_version` to identify the
88+
device exported by the CUDA backend.
89+
If the NVIDIA OpenCL implementation is available on the
90+
system, it will be reported as another SYCL device. The driver
91+
version is the best way to differentiate between the two.
92+
93+
The command group is created as a lambda expression that takes the
94+
`sycl::handler` parameter. Accessors are obtained from buffers using the
95+
`get_access` method.
96+
Finally the `parallel_for` with the SYCL kernel is invoked as usual.
97+
98+
The command group is submitted to a queue which will convert all the
99+
operations into CUDA commands that will be executed once the host accessor
100+
is encountered later on.
101+
102+
The host accessor will trigger a copy of the data back to the host, and
103+
then the values are reduced into a single sum element.

example-01/vector_addition.cpp

Lines changed: 91 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,91 @@
1+
/**
2+
* SYCL FOR CUDA : Vector Addition Example
3+
*
4+
* Copyright 2020 Codeplay Software Ltd.
5+
*
6+
* Licensed under the Apache License, Version 2.0 (the "License");
7+
* you may not use this file except in compliance with the License.
8+
* You may obtain a copy of the License at
9+
*
10+
* http://www.apache.org/licenses/LICENSE-2.0
11+
*
12+
* Unless required by applicable law or agreed to in writing, software
13+
* distributed under the License is distributed on an "AS IS" BASIS,
14+
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
15+
* See the License for the specific language governing permissions and
16+
* limitations under the License.
17+
*
18+
* @File: vector_addition.cpp
19+
*/
20+
21+
#include <algorithm>
22+
#include <iostream>
23+
#include <vector>
24+
25+
#include <CL/sycl.hpp>
26+
27+
class CUDASelector : public sycl::device_selector {
28+
public:
29+
int operator()(const sycl::device &Device) const override {
30+
using namespace sycl::info;
31+
32+
const std::string DriverVersion = Device.get_info<device::driver_version>();
33+
34+
if (Device.is_gpu() && (DriverVersion.find("CUDA") != std::string::npos)) {
35+
std::cout << " CUDA device found " << std::endl;
36+
return 1;
37+
};
38+
return -1;
39+
}
40+
};
41+
42+
class vec_add;
43+
int main(int argc, char *argv[]) {
44+
constexpr const size_t N = 100000;
45+
const sycl::range VecSize{N};
46+
47+
sycl::buffer<double> bufA{VecSize};
48+
sycl::buffer<double> bufB{VecSize};
49+
sycl::buffer<double> bufC{VecSize};
50+
51+
// Initialize input data
52+
{
53+
const auto dwrite_t = sycl::access::mode::discard_write;
54+
55+
auto h_a = bufA.get_access<dwrite_t>();
56+
auto h_b = bufB.get_access<dwrite_t>();
57+
for (int i = 0; i < N; i++) {
58+
h_a[i] = sin(i) * sin(i);
59+
h_b[i] = cos(i) * cos(i);
60+
}
61+
}
62+
63+
sycl::queue myQueue{CUDASelector()};
64+
65+
// Command Group creation
66+
auto cg = [&](sycl::handler &h) {
67+
const auto read_t = sycl::access::mode::read;
68+
const auto write_t = sycl::access::mode::write;
69+
70+
auto a = bufA.get_access<read_t>(h);
71+
auto b = bufB.get_access<read_t>(h);
72+
auto c = bufC.get_access<write_t>(h);
73+
74+
h.parallel_for<vec_add>(VecSize,
75+
[=](sycl::id<1> i) { c[i] = a[i] + b[i]; });
76+
};
77+
78+
myQueue.submit(cg);
79+
80+
{
81+
const auto write_t = sycl::access::mode::read;
82+
auto h_c = bufC.get_access<write_t>();
83+
double sum = 0.0f;
84+
for (int i = 0; i < N; i++) {
85+
sum += h_c[i];
86+
}
87+
std::cout << "Sum is : " << sum << std::endl;
88+
}
89+
90+
return 0;
91+
}

example-01/vector_addition.cu

Lines changed: 90 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,90 @@
1+
// Original source reproduced unmodified here from:
2+
// https://github.com/olcf/vector_addition_tutorials/blob/master/CUDA/vecAdd.cu
3+
4+
#include <math.h>
5+
#include <stdio.h>
6+
#include <stdlib.h>
7+
8+
// CUDA kernel. Each thread takes care of one element of c
9+
__global__ void vecAdd(double *a, double *b, double *c, int n) {
10+
// Get our global thread ID
11+
int id = blockIdx.x * blockDim.x + threadIdx.x;
12+
13+
// Make sure we do not go out of bounds
14+
if (id < n)
15+
c[id] = a[id] + b[id];
16+
}
17+
18+
int main(int argc, char *argv[]) {
19+
// Size of vectors
20+
int n = 100000;
21+
22+
// Host input vectors
23+
double *h_a;
24+
double *h_b;
25+
// Host output vector
26+
double *h_c;
27+
28+
// Device input vectors
29+
double *d_a;
30+
double *d_b;
31+
// Device output vector
32+
double *d_c;
33+
34+
// Size, in bytes, of each vector
35+
size_t bytes = n * sizeof(double);
36+
37+
// Allocate memory for each vector on host
38+
h_a = (double *)malloc(bytes);
39+
h_b = (double *)malloc(bytes);
40+
h_c = (double *)malloc(bytes);
41+
42+
// Allocate memory for each vector on GPU
43+
cudaMalloc(&d_a, bytes);
44+
cudaMalloc(&d_b, bytes);
45+
cudaMalloc(&d_c, bytes);
46+
47+
int i;
48+
// Initialize vectors on host
49+
for (i = 0; i < n; i++) {
50+
h_a[i] = sin(i) * sin(i);
51+
h_b[i] = cos(i) * cos(i);
52+
}
53+
54+
// Copy host vectors to device
55+
cudaMemcpy(d_a, h_a, bytes, cudaMemcpyHostToDevice);
56+
cudaMemcpy(d_b, h_b, bytes, cudaMemcpyHostToDevice);
57+
58+
int blockSize, gridSize;
59+
60+
// Number of threads in each thread block
61+
blockSize = 1024;
62+
63+
// Number of thread blocks in grid
64+
gridSize = (int)ceil((float)n / blockSize);
65+
66+
// Execute the kernel
67+
vecAdd<<<gridSize, blockSize>>>(d_a, d_b, d_c, n);
68+
69+
// Copy array back to host
70+
cudaMemcpy(h_c, d_c, bytes, cudaMemcpyDeviceToHost);
71+
72+
// Sum up vector c and print result divided by n, this should equal 1 within
73+
// error
74+
double sum = 0;
75+
for (i = 0; i < n; i++)
76+
sum += h_c[i];
77+
printf("final result: %f\n", sum / n);
78+
79+
// Release device memory
80+
cudaFree(d_a);
81+
cudaFree(d_b);
82+
cudaFree(d_c);
83+
84+
// Release host memory
85+
free(h_a);
86+
free(h_b);
87+
free(h_c);
88+
89+
return 0;
90+
}

0 commit comments

Comments
 (0)