Skip to content

Commit b99d13a

Browse files
Merge pull request #3 from codeplaysoftware/add-research-papers
Add numerous research papers
2 parents 2acc9aa + 225abef commit b99d13a

File tree

30 files changed

+907
-0
lines changed

30 files changed

+907
-0
lines changed
Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
---
2+
contributor: max
3+
date: '2016-10-26T10:57:29+01:00'
4+
title: 'A Comparative Study of SYCL, OpenCL, and OpenMP'
5+
external_url: https://ieeexplore.ieee.org/document/7803697
6+
authors:
7+
- name: Hércules Cardoso da Silva
8+
affiliation: Inst. of Comput
9+
- name: Flávia Pisani
10+
affiliation: Institute of Computing,
11+
- name: Edson Borin
12+
affiliation: Institute of Computing
13+
tags:
14+
- opencl
15+
- openmp
16+
- parallel
17+
- performance
18+
- evaluation
19+
---
20+
21+
Recent trends indicate that future computing systems will be composed by a group of heterogeneous computing devices,
22+
including CPUs, GPUs, and other hardware accelerators. These devices provide increased processing performance, however,
23+
creating efficient code for them may require that programmers manage memory assignments and use specialized APIs,
24+
compilers, or runtime systems, thus making their programs dependent on specific tools. In this scenario, SYCL is an
25+
emerging C++ programming model for OpenCL that allows developers to write code for heterogeneous computing devices that
26+
are compatible with standard C++ compilation frameworks. In this paper, we analyze the performance and programming
27+
characteristics of SYCL, OpenMP, and OpenCL using both a benchmark and a real-world application. Our performance results
28+
indicate that programs that rely on available SYCL runtimes are not on par with the ones based on OpenMP and OpenCL yet.
29+
Nonetheless, the gap is getting smaller if we consider the results reported by previous studies. In terms of
30+
programmability, SYCL presents itself as a competitive alternative to OpenCL, requiring fewer lines of code to implement
31+
kernels and also fewer calls to essential API functions and methods.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
---
2+
contributor: scott
3+
date: '2018-07-10T08:08:10.490000+00:00'
4+
title: 'Solving Maxwells Equations with Modern C++ and SYCL: A Case Study'
5+
external_url: https://ieeexplore.ieee.org/document/8445127
6+
authors:
7+
- name: Ayesha Afzal
8+
affiliation: Friedrich-Alexander university Erlangen-Nurnberg
9+
- name: Christian Schmitt
10+
affiliation: Friedrich-Alexander university Erlangen-Nurnberg
11+
- name: Samer Alhaddad
12+
affiliation: Paderborn University
13+
- name: Yevgen Grynko
14+
affiliation: Paderborn University
15+
- name: Jurgen Teich
16+
affiliation: Friedrich-Alexander university Erlangen-Nurnberg
17+
- name: Jens Forstner
18+
affiliation: Paderborn University
19+
- name: Frank Hannig
20+
affiliation: Friedrich-Alexander university Erlangen-Nurnberg
21+
tags:
22+
- maxwell
23+
- c++
24+
- case-study
25+
---
26+
27+
In scientific computing, unstructured meshes are a crucial foundation for the simulation of real-world physical
28+
phenomena. Compared to regular grids, they allow resembling the computational domain with a much higher accuracy, which
29+
in turn leads to more efficient computations. There exists a wealth of supporting libraries and frameworks that aid
30+
programmers with the implementation of applications working on such grids, each built on top of existing parallelization
31+
technologies. However, many approaches require the programmer to introduce a different programming paradigm into their
32+
application or provide different variants of the code. SYCL is a new programming standard providing a remedy to this
33+
dilemma by building on standard C++ 17 with its so-called single-source approach: Programmers write standard C++ code
34+
and expose parallelism using C++ 17 keywords. The application is then transformed into a concrete implementation by the
35+
SYCL implementation. By encapsulating the OpenCL ecosystem, different SYCL implementations enable not only the
36+
programming of CPUs but also of heterogeneous platforms such as GPUs or other devices. For the first time, this paper
37+
showcases a SY CL-based solver for the nodal Discontinuous Galerkin method for Maxwell's equations on unstructured
38+
meshes. We compare our solution to a previous C-based implementation with respect to programmability and performance on
39+
heterogeneous platforms.
Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
---
2+
contributor: scott
3+
date: '2019-11-18T10:57:29+01:00'
4+
title: Evaluation of Medical Imaging Applications using SYCL
5+
external_url: https://ieeexplore.ieee.org/document/8982983
6+
authors:
7+
- name: Zheming Jin
8+
affiliation: Argonne National Laboratory
9+
- name: Hal Finkel
10+
affiliation: Argonne National Laboratory
11+
tags:
12+
- benchmark
13+
- performance
14+
- medical
15+
- rodina
16+
- imaging
17+
---
18+
19+
As opposed to the Open Computing Language (OpenCL) programming model in which host and device codes are written in
20+
different languages, the SYCL programming model can combine host and device codes for an application in a type-safe way
21+
to improve development productivity. In this paper, we chose two medical imaging applications (Heart Wall and Particle
22+
Filter) in the Rodinia benchmark suite to study the performance and programming productivity of the SYCL programming
23+
model. More specifically, we introduced the SYCL programming model, shared our experience of implementing the
24+
applications using SYCL, and compared the performance and programming portability of the SYCL implementations with the
25+
OpenCL implementations on an Intel® Xeon® CPU and an Iris® Pro integrated GPU. The results are promising. For the Heart
26+
Wall application, the SYCL implementation is on average 15% faster than the OpenCL implementation on the GPU. For the
27+
Particle Filter application, the SYCL implementation is 3% slower than the OpenCL implementation on the GPU, but it is
28+
75% faster on the CPU. Using lines of code as an indicator of programming productivity, the SYCL host program reduces
29+
the lines of code of the OpenCL host program by 52% and 38% for the Heart Wall and Particle Filter applications,
30+
respectively.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
---
2+
contributor: scott
3+
date: '2019-11-22T10:57:29+01:00'
4+
title: 'Performance Portability of a Wilson Dslash Stencil Operator Mini-App Using Kokkos and SYCL'
5+
external_url: https://ieeexplore.ieee.org/document/8945798
6+
authors:
7+
- name: Bálint Joó
8+
affiliation: Jefferson Lab
9+
- name: Thorsten Kurth
10+
affiliation: NERSC
11+
- name: M. A. Clark
12+
affiliation: NVIDIA
13+
- name: Jeongnim Kim
14+
affiliation: Intel Corporation
15+
- name: Christian Robert Trott
16+
affiliation: Sandia National Laboratories
17+
- name: Dan Ibanez
18+
affiliation: Sandia National Laboratories
19+
- name: Daniel Sunderland
20+
affiliation: Sandia National Laboratories
21+
- name: Jack Deslippe
22+
affiliation: NERSC
23+
tags:
24+
- kokkos
25+
- performance
26+
- portability
27+
- lattice-qcd
28+
---
29+
30+
We describe our experiences in creating mini-apps for the Wilson-Dslash stencil operator for Lattice Quantum
31+
Chromodynamics using the Kokkos and SYCL programming models. In particular we comment on the performance achieved on a
32+
variety of hardware architectures, limitations we have reached in both programming models and how these have been
33+
resolved by us, or may be resolved by the developers of these models.
Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
---
2+
contributor: scott
3+
date: '2019-12-09T10:57:29+01:00'
4+
title: A Case Study of k-means Clustering using SYCL
5+
external_url: https://ieeexplore.ieee.org/document/9005555
6+
authors:
7+
- name: Zheming Jin
8+
affiliation: Argonne National Laboratory
9+
- name: Hal Finkel
10+
affiliation: Argonne National Laboratory
11+
tags:
12+
- benchmark
13+
- energy-consumption
14+
- programming-language
15+
- gpu
16+
- lowest-consumption
17+
- rodinia
18+
- minimum-distance
19+
- data-transfer
20+
- api
21+
- means-clustering
22+
- fuzzy-clustering
23+
- haswell
24+
- broadwell
25+
- skywell
26+
---
27+
28+
As opposed to the OpenCL programming model in which host and device codes are written in two programming languages, the
29+
SYCL programming model combines them for an application in a type-safe way to improve development productivity. As a
30+
popular cluster analysis algorithm, k-means has been implemented using programming models such as OpenMP, OpenCL, and
31+
CUDA. Developing a SYCL implementation of k-means as a case study allows us to have a better understanding of
32+
performance portability and programming productivity of the SYCL programming model. Specifically, we explained the
33+
k-means benchmark in Rodinia, described our efforts of porting the OpenCL k-means benchmark, and evaluated the
34+
performance of the OpenCL and SYCL implementations on the Intel ® Haswell, Broadwell, and Skylake processors. We
35+
summarized the migration steps from OpenCL to SYCL, compiled the SYCL program using Codeplay and Intel ® SYCL compilers,
36+
analyzed the SYCL and OpenCL programs using an open-source profiling tool which can intercept OpenCL runtime calls, and
37+
compared the performance of the implementations on Intel ® CPUs and integrated GPU. The experimental results show that
38+
the SYCL version in which the kernels run on the GPU is 2% and 8% faster than the OpenCL version for the two large
39+
datasets. However, the OpenCL version is still much faster than the SYCL version on the CPUs. Compared to the Intel ®
40+
Haswell and Skylake CPUs, running the k-means benchmark on the Intel ® Broadwell low-power processor with a CPU and an
41+
integrated GPU can achieve the lowest energy consumption. In terms of programming productivity, the lines of code of the
42+
SYCL program are 51% fewer than those of the OpenCL program.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
---
2+
contributor: scott
3+
date: '2022-05-18T08:08:10.490000+00:00'
4+
title: 'A Case Study on the HACCmk Routine in SYCL on Integrated Graphics'
5+
external_url: https://ieeexplore.ieee.org/document/9150310
6+
authors:
7+
- name: Zheming Jin
8+
affiliation: Argonne National Laboratory
9+
- name: Vitali Morozov
10+
affiliation: Argonne National Laboratory
11+
- name: Hal Finkel
12+
affiliation: Argonne National Laboratory
13+
tags:
14+
- compute
15+
- haccmk
16+
- integrated-grapghics
17+
- case-study
18+
---
19+
20+
As opposed to the Open Computing Language (OpenCL) programming model in which host and device codes are generally
21+
written in different languages, the SYCL programming model can combine host and device codes for an application in a
22+
type-safe way to improve development productivity. In this paper, we chose the HACCmk routine, a representative
23+
compute-bound kernel, as a case study on the performance of the SYCL programming model targeting a heterogeneous
24+
computing device. More specifically, we introduced the SYCL programming model, presented the OpenCL and SYCL
25+
implementations of the routine, and compared the performance of the two implementations using the offline and online
26+
compilation on Intel® IrisTM Pro integrated GPUs. We found that the overhead of online compilation may become
27+
significant compared to the execution time of a kernel. Compared to the performance of OpenCL implementations, the SYCL
28+
implementation can maintain the performance using the offline compilation. The number of execution units in a GPU are
29+
critical to improving the raw performance of a compute-bound kernel.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
---
2+
contributor: scott
3+
date: '2022-05-18T08:08:10.490000+00:00'
4+
title: 'Towards automated kernel selection in machine learning systems: A SYCL case study'
5+
external_url: https://ieeexplore.ieee.org/document/9150358
6+
authors:
7+
- name: John Lawson
8+
affiliation: Codeplay Software Ltd
9+
tags:
10+
- tuning
11+
- sycl
12+
- gpgpu
13+
- machine-learning
14+
- ai
15+
---
16+
17+
Automated tuning of compute kernels is a popular area of research, mainly focused on finding optimal kernel parameters
18+
for a problem with fixed input sizes. This approach is good for deploying machine learning models, where the network
19+
topology is constant, but machine learning research often involves changing network topologies and hyperparameters.
20+
Traditional kernel auto-tuning has limited impact in this case; a more general selection of kernels is required for
21+
libraries to accelerate machine learning research. In this paper we present initial results using machine learning to
22+
select kernels in a case study deploying high performance SYCL kernels in libraries that target a range of heterogeneous
23+
devices from desktop GPUs to embedded accelerators. The techniques investigated apply more generally and could similarly
24+
be integrated with other heterogeneous programming systems. By combining auto-tuning and machine learning these kernel
25+
selection processes can be deployed with little developer effort to achieve high performance on new hardware.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
---
2+
contributor: scott
3+
date: '2020-11-13T08:08:10.490000+00:00'
4+
title: 'Evaluating the Performance and Portability of Contemporary SYCL Implementations'
5+
external_url: https://ieeexplore.ieee.org/document/9309045
6+
authors:
7+
- name: Beau Johnston
8+
affiliation: Oak Ridge National Laboratory
9+
- name: Jeffrey S. Vetter
10+
affiliation: Oak Ridge National Laboratory
11+
- name: Josh Milthorpe
12+
affiliation: Australian National University
13+
tags:
14+
- benchmarks
15+
- performance
16+
- portability
17+
---
18+
19+
SYCL is a single-source programming model for heterogeneous systems; it promises improved maintainability, productivity,
20+
and opportunity for compiler optimization, when compared to accelerator specific programming models. Several
21+
implementations of the SYCL standard have been developed over the past few years, including several backends using
22+
contemporary accelerator languages, like OpenCL, CUDA, and HIP. These implementations vary widely in their support for
23+
specific features of the standard and in their performance. As SYCL grows in popularity, developers need to know how
24+
features are implemented across popular implementations in order to make proper design choices. In this paper, we
25+
evaluate the existing SYCL implementations for important SYCL features across a range of hardware in order to understand
26+
SYCL's performance and portability. This work uses the newest SYCL benchmark suite (SYCL-Bench, 38 kernels) to evaluate
27+
these four existing implementations, comparing support of language features across backends and highlighting feature
28+
completeness and performance. For features, we focus on the five major SYCL parallel constructs, using a motivating
29+
example of the matrix multiplication benchmark. Our results show that the basic data parallelism construct is the best
30+
choice for performance on current SYCL implementations, and we identify opportunities for improvement in several of the
31+
SYCL implementations.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
---
2+
contributor: scott
3+
date: '2021-06-07T08:08:10.490000+00:00'
4+
title: 'Automatic Parallelization of Structured Mesh Computations with SYCL'
5+
external_url: https://ieeexplore.ieee.org/document/9555976
6+
authors:
7+
- name: Gábor Dániel Balogh
8+
affiliation: Pázmány Péter Catholic University
9+
- name: István Reguly
10+
affiliation: Pázmány Péter Catholic University
11+
tags:
12+
- parallel-programming
13+
- nvidia
14+
- intel
15+
---
16+
17+
Structured meshes are widely used for scientific computations such as Computational Fluid Dynamics (CFD) applications or
18+
finance. Modern applications often have grid points in the millions. To perform such computations parallelisation is
19+
crucial. However it is unfeasible to port each application every time a new architecture arrives, hence in recent years
20+
the demand for automatic parallelisation and optimisation for the used hardware is increasing. The OPS (Oxford Parallel
21+
library for Structured mesh solvers) has shown good performance and scaling on a wide range of HPC architectures. This
22+
research aims to extend the OPS framework with a SYCL backend to extend the range of architectures that OPS can support
23+
and further increase Performance Portability of OPS applications. The performance of the Intel OneAPI is struggling with
24+
reductions due to high synchronisation cost, but shows promising performance gain on builtin reduction constructs on an
25+
Intel® Xeon® Gold 6226R. We compare the performance of hipSYCL on NVidia V100 GPU to the CUDA implementations.
Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
---
2+
contributor: scott
3+
date: '2021-11-14T08:08:10.490000+00:00'
4+
title: 'Benchmarking and Extending SYCL Hierarchical Parallelism'
5+
external_url: https://ieeexplore.ieee.org/document/9654235
6+
authors:
7+
- name: Tom Deakin
8+
affiliation: University of Bristol
9+
- name: Simon McIntosh-Smith
10+
affiliation: University of Bristol
11+
- name: Aksel Alpay
12+
affiliation: Universität Heidelberg
13+
- name: Vincent Heuveline
14+
affiliation: Universität Heidelberg
15+
tags:
16+
- benchmarks
17+
- extending
18+
- parallelism
19+
---
20+
21+
SYCL is an open-standard, parallel programming model for programming heterogeneous devices from Khronos. It allows
22+
single-source programming of diverse attached devices in a cross-platform manner in modern C++. SYCL provides different
23+
layers of parallel abstractions, including Same Instruction Multiple Thread (SIMT) kernels, data-parallel loop
24+
concurrency and hierarchical parallelism. We discuss Scoped Parallelism as an extension to the existing Hierarchical
25+
Parallelism in SYCL, and highlight the advantages and disadvantages of these models from the perspective of the
26+
programmer and an implementer of SYCL. In this paper, we compare writing benchmark programs using SIMT kernel,
27+
hierarchical parallelism and scoped parallelism paradigms, and present results running on a high-performance CPU and
28+
GPU.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
---
2+
contributor: scott
3+
date: '2021-11-14T08:08:10.490000+00:00'
4+
title: 'Case Study of Using Kokkos and SYCL as Performance-Portable Frameworks for Milc-Dslash Benchmark on NVIDIA, AMD and Intel GPUs'
5+
external_url: https://ieeexplore.ieee.org/document/9652859
6+
authors:
7+
- name: Amanda S. Dufek
8+
affiliation: NERSC/LBNL
9+
- name: Rahulkumar Gayatri
10+
affiliation: NERSC/LBNL
11+
- name: Neil Mehta
12+
affiliation: NERSC/LBNL
13+
- name: Douglas Doerfler
14+
affiliation: NERSC/LBNL
15+
- name: Brandon Cook
16+
affiliation: NERSC/LBNL
17+
- name: Yasaman Ghadar
18+
affiliation: Argonne National Laboratory
19+
- name: Carleton DeTar
20+
affiliation: University of Utah
21+
tags:
22+
- kokkos
23+
- milc-dslash
24+
- performance
25+
- portability
26+
- nvidia
27+
- intel
28+
- amd
29+
---
30+
31+
Six of the top ten supercomputers in the TOP500 list from June 2021 rely on NVIDIA GPUs to achieve their peak compute
32+
bandwidth. With the announcement of Aurora, Frontier, and El Capitan, Intel and AMD have also entered the domain of
33+
providing GPUs for scientific computing. A consequence of the increased diversity in the GPU landscape is the emergence
34+
of portable programming models such as Kokkos, SYCL, OpenCL, and OpenMP, which allow application developers to maintain
35+
a single-source code across a diverse range of hardware architectures. While the portable frameworks try to optimize the
36+
compute resource usage on a given architecture, it is the programmers responsibility to expose parallelism in an
37+
application that can take advantage of thousands of processing elements available on GPUs. In this paper, we introduce a
38+
GPU-friendly parallel implementation of Milc-Dslash that exposes multiple hierarchies of parallelism in the algorithm.
39+
Milc-Dslash was designed to serve as a benchmark with highly optimized matrix-vector multiplications to measure the
40+
resource utilization on the GPU systems. The parallel hierarchies in the Milc-Dslash algorithm are mapped onto a target
41+
hardware using Kokkos and SYCL programming models. We present the performance achieved by Kokkos and SYCL
42+
implementations of Milc-Dslash on NVIDIA A100 GPU, AMD MI100 GPU, and Intel Gen9 GPU. Additionally, we compare the
43+
Kokkos and SYCL performances with those obtained from the versions written in CUDA and HIP programming models on NVIDIA
44+
A100 GPU and AMD MI100 GPU, respectively.

0 commit comments

Comments
 (0)