Skip to content

Add numerous research papers #3

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 11, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
---
contributor: max
date: '2016-10-26T10:57:29+01:00'
title: 'A Comparative Study of SYCL, OpenCL, and OpenMP'
external_url: https://ieeexplore.ieee.org/document/7803697
authors:
- name: Hércules Cardoso da Silva
affiliation: Inst. of Comput
- name: Flávia Pisani
affiliation: Institute of Computing,
- name: Edson Borin
affiliation: Institute of Computing
tags:
- opencl
- openmp
- parallel
- performance
- evaluation
---

Recent trends indicate that future computing systems will be composed by a group of heterogeneous computing devices,
including CPUs, GPUs, and other hardware accelerators. These devices provide increased processing performance, however,
creating efficient code for them may require that programmers manage memory assignments and use specialized APIs,
compilers, or runtime systems, thus making their programs dependent on specific tools. In this scenario, SYCL is an
emerging C++ programming model for OpenCL that allows developers to write code for heterogeneous computing devices that
are compatible with standard C++ compilation frameworks. In this paper, we analyze the performance and programming
characteristics of SYCL, OpenMP, and OpenCL using both a benchmark and a real-world application. Our performance results
indicate that programs that rely on available SYCL runtimes are not on par with the ones based on OpenMP and OpenCL yet.
Nonetheless, the gap is getting smaller if we consider the results reported by previous studies. In terms of
programmability, SYCL presents itself as a competitive alternative to OpenCL, requiring fewer lines of code to implement
kernels and also fewer calls to essential API functions and methods.
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
---
contributor: scott
date: '2018-07-10T08:08:10.490000+00:00'
title: 'Solving Maxwells Equations with Modern C++ and SYCL: A Case Study'
external_url: https://ieeexplore.ieee.org/document/8445127
authors:
- name: Ayesha Afzal
affiliation: Friedrich-Alexander university Erlangen-Nurnberg
- name: Christian Schmitt
affiliation: Friedrich-Alexander university Erlangen-Nurnberg
- name: Samer Alhaddad
affiliation: Paderborn University
- name: Yevgen Grynko
affiliation: Paderborn University
- name: Jurgen Teich
affiliation: Friedrich-Alexander university Erlangen-Nurnberg
- name: Jens Forstner
affiliation: Paderborn University
- name: Frank Hannig
affiliation: Friedrich-Alexander university Erlangen-Nurnberg
tags:
- maxwell
- c++
- case-study
---

In scientific computing, unstructured meshes are a crucial foundation for the simulation of real-world physical
phenomena. Compared to regular grids, they allow resembling the computational domain with a much higher accuracy, which
in turn leads to more efficient computations. There exists a wealth of supporting libraries and frameworks that aid
programmers with the implementation of applications working on such grids, each built on top of existing parallelization
technologies. However, many approaches require the programmer to introduce a different programming paradigm into their
application or provide different variants of the code. SYCL is a new programming standard providing a remedy to this
dilemma by building on standard C++ 17 with its so-called single-source approach: Programmers write standard C++ code
and expose parallelism using C++ 17 keywords. The application is then transformed into a concrete implementation by the
SYCL implementation. By encapsulating the OpenCL ecosystem, different SYCL implementations enable not only the
programming of CPUs but also of heterogeneous platforms such as GPUs or other devices. For the first time, this paper
showcases a SY CL-based solver for the nodal Discontinuous Galerkin method for Maxwell's equations on unstructured
meshes. We compare our solution to a previous C-based implementation with respect to programmability and performance on
heterogeneous platforms.
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
---
contributor: scott
date: '2019-11-18T10:57:29+01:00'
title: Evaluation of Medical Imaging Applications using SYCL
external_url: https://ieeexplore.ieee.org/document/8982983
authors:
- name: Zheming Jin
affiliation: Argonne National Laboratory
- name: Hal Finkel
affiliation: Argonne National Laboratory
tags:
- benchmark
- performance
- medical
- rodina
- imaging
---

As opposed to the Open Computing Language (OpenCL) programming model in which host and device codes are written in
different languages, the SYCL programming model can combine host and device codes for an application in a type-safe way
to improve development productivity. In this paper, we chose two medical imaging applications (Heart Wall and Particle
Filter) in the Rodinia benchmark suite to study the performance and programming productivity of the SYCL programming
model. More specifically, we introduced the SYCL programming model, shared our experience of implementing the
applications using SYCL, and compared the performance and programming portability of the SYCL implementations with the
OpenCL implementations on an Intel® Xeon® CPU and an Iris® Pro integrated GPU. The results are promising. For the Heart
Wall application, the SYCL implementation is on average 15% faster than the OpenCL implementation on the GPU. For the
Particle Filter application, the SYCL implementation is 3% slower than the OpenCL implementation on the GPU, but it is
75% faster on the CPU. Using lines of code as an indicator of programming productivity, the SYCL host program reduces
the lines of code of the OpenCL host program by 52% and 38% for the Heart Wall and Particle Filter applications,
respectively.
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
---
contributor: scott
date: '2019-11-22T10:57:29+01:00'
title: 'Performance Portability of a Wilson Dslash Stencil Operator Mini-App Using Kokkos and SYCL'
external_url: https://ieeexplore.ieee.org/document/8945798
authors:
- name: Bálint Joó
affiliation: Jefferson Lab
- name: Thorsten Kurth
affiliation: NERSC
- name: M. A. Clark
affiliation: NVIDIA
- name: Jeongnim Kim
affiliation: Intel Corporation
- name: Christian Robert Trott
affiliation: Sandia National Laboratories
- name: Dan Ibanez
affiliation: Sandia National Laboratories
- name: Daniel Sunderland
affiliation: Sandia National Laboratories
- name: Jack Deslippe
affiliation: NERSC
tags:
- kokkos
- performance
- portability
- lattice-qcd
---

We describe our experiences in creating mini-apps for the Wilson-Dslash stencil operator for Lattice Quantum
Chromodynamics using the Kokkos and SYCL programming models. In particular we comment on the performance achieved on a
variety of hardware architectures, limitations we have reached in both programming models and how these have been
resolved by us, or may be resolved by the developers of these models.
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
---
contributor: scott
date: '2019-12-09T10:57:29+01:00'
title: A Case Study of k-means Clustering using SYCL
external_url: https://ieeexplore.ieee.org/document/9005555
authors:
- name: Zheming Jin
affiliation: Argonne National Laboratory
- name: Hal Finkel
affiliation: Argonne National Laboratory
tags:
- benchmark
- energy-consumption
- programming-language
- gpu
- lowest-consumption
- rodinia
- minimum-distance
- data-transfer
- api
- means-clustering
- fuzzy-clustering
- haswell
- broadwell
- skywell
---

As opposed to the OpenCL programming model in which host and device codes are written in two programming languages, the
SYCL programming model combines them for an application in a type-safe way to improve development productivity. As a
popular cluster analysis algorithm, k-means has been implemented using programming models such as OpenMP, OpenCL, and
CUDA. Developing a SYCL implementation of k-means as a case study allows us to have a better understanding of
performance portability and programming productivity of the SYCL programming model. Specifically, we explained the
k-means benchmark in Rodinia, described our efforts of porting the OpenCL k-means benchmark, and evaluated the
performance of the OpenCL and SYCL implementations on the Intel ® Haswell, Broadwell, and Skylake processors. We
summarized the migration steps from OpenCL to SYCL, compiled the SYCL program using Codeplay and Intel ® SYCL compilers,
analyzed the SYCL and OpenCL programs using an open-source profiling tool which can intercept OpenCL runtime calls, and
compared the performance of the implementations on Intel ® CPUs and integrated GPU. The experimental results show that
the SYCL version in which the kernels run on the GPU is 2% and 8% faster than the OpenCL version for the two large
datasets. However, the OpenCL version is still much faster than the SYCL version on the CPUs. Compared to the Intel ®
Haswell and Skylake CPUs, running the k-means benchmark on the Intel ® Broadwell low-power processor with a CPU and an
integrated GPU can achieve the lowest energy consumption. In terms of programming productivity, the lines of code of the
SYCL program are 51% fewer than those of the OpenCL program.
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
---
contributor: scott
date: '2022-05-18T08:08:10.490000+00:00'
title: 'A Case Study on the HACCmk Routine in SYCL on Integrated Graphics'
external_url: https://ieeexplore.ieee.org/document/9150310
authors:
- name: Zheming Jin
affiliation: Argonne National Laboratory
- name: Vitali Morozov
affiliation: Argonne National Laboratory
- name: Hal Finkel
affiliation: Argonne National Laboratory
tags:
- compute
- haccmk
- integrated-grapghics
- case-study
---

As opposed to the Open Computing Language (OpenCL) programming model in which host and device codes are generally
written in different languages, the SYCL programming model can combine host and device codes for an application in a
type-safe way to improve development productivity. In this paper, we chose the HACCmk routine, a representative
compute-bound kernel, as a case study on the performance of the SYCL programming model targeting a heterogeneous
computing device. More specifically, we introduced the SYCL programming model, presented the OpenCL and SYCL
implementations of the routine, and compared the performance of the two implementations using the offline and online
compilation on Intel® IrisTM Pro integrated GPUs. We found that the overhead of online compilation may become
significant compared to the execution time of a kernel. Compared to the performance of OpenCL implementations, the SYCL
implementation can maintain the performance using the offline compilation. The number of execution units in a GPU are
critical to improving the raw performance of a compute-bound kernel.
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
---
contributor: scott
date: '2022-05-18T08:08:10.490000+00:00'
title: 'Towards automated kernel selection in machine learning systems: A SYCL case study'
external_url: https://ieeexplore.ieee.org/document/9150358
authors:
- name: John Lawson
affiliation: Codeplay Software Ltd
tags:
- tuning
- sycl
- gpgpu
- machine-learning
- ai
---

Automated tuning of compute kernels is a popular area of research, mainly focused on finding optimal kernel parameters
for a problem with fixed input sizes. This approach is good for deploying machine learning models, where the network
topology is constant, but machine learning research often involves changing network topologies and hyperparameters.
Traditional kernel auto-tuning has limited impact in this case; a more general selection of kernels is required for
libraries to accelerate machine learning research. In this paper we present initial results using machine learning to
select kernels in a case study deploying high performance SYCL kernels in libraries that target a range of heterogeneous
devices from desktop GPUs to embedded accelerators. The techniques investigated apply more generally and could similarly
be integrated with other heterogeneous programming systems. By combining auto-tuning and machine learning these kernel
selection processes can be deployed with little developer effort to achieve high performance on new hardware.
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
---
contributor: scott
date: '2020-11-13T08:08:10.490000+00:00'
title: 'Evaluating the Performance and Portability of Contemporary SYCL Implementations'
external_url: https://ieeexplore.ieee.org/document/9309045
authors:
- name: Beau Johnston
affiliation: Oak Ridge National Laboratory
- name: Jeffrey S. Vetter
affiliation: Oak Ridge National Laboratory
- name: Josh Milthorpe
affiliation: Australian National University
tags:
- benchmarks
- performance
- portability
---

SYCL is a single-source programming model for heterogeneous systems; it promises improved maintainability, productivity,
and opportunity for compiler optimization, when compared to accelerator specific programming models. Several
implementations of the SYCL standard have been developed over the past few years, including several backends using
contemporary accelerator languages, like OpenCL, CUDA, and HIP. These implementations vary widely in their support for
specific features of the standard and in their performance. As SYCL grows in popularity, developers need to know how
features are implemented across popular implementations in order to make proper design choices. In this paper, we
evaluate the existing SYCL implementations for important SYCL features across a range of hardware in order to understand
SYCL's performance and portability. This work uses the newest SYCL benchmark suite (SYCL-Bench, 38 kernels) to evaluate
these four existing implementations, comparing support of language features across backends and highlighting feature
completeness and performance. For features, we focus on the five major SYCL parallel constructs, using a motivating
example of the matrix multiplication benchmark. Our results show that the basic data parallelism construct is the best
choice for performance on current SYCL implementations, and we identify opportunities for improvement in several of the
SYCL implementations.
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
---
contributor: scott
date: '2021-06-07T08:08:10.490000+00:00'
title: 'Automatic Parallelization of Structured Mesh Computations with SYCL'
external_url: https://ieeexplore.ieee.org/document/9555976
authors:
- name: Gábor Dániel Balogh
affiliation: Pázmány Péter Catholic University
- name: István Reguly
affiliation: Pázmány Péter Catholic University
tags:
- parallel-programming
- nvidia
- intel
---

Structured meshes are widely used for scientific computations such as Computational Fluid Dynamics (CFD) applications or
finance. Modern applications often have grid points in the millions. To perform such computations parallelisation is
crucial. However it is unfeasible to port each application every time a new architecture arrives, hence in recent years
the demand for automatic parallelisation and optimisation for the used hardware is increasing. The OPS (Oxford Parallel
library for Structured mesh solvers) has shown good performance and scaling on a wide range of HPC architectures. This
research aims to extend the OPS framework with a SYCL backend to extend the range of architectures that OPS can support
and further increase Performance Portability of OPS applications. The performance of the Intel OneAPI is struggling with
reductions due to high synchronisation cost, but shows promising performance gain on builtin reduction constructs on an
Intel® Xeon® Gold 6226R. We compare the performance of hipSYCL on NVidia V100 GPU to the CUDA implementations.
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
---
contributor: scott
date: '2021-11-14T08:08:10.490000+00:00'
title: 'Benchmarking and Extending SYCL Hierarchical Parallelism'
external_url: https://ieeexplore.ieee.org/document/9654235
authors:
- name: Tom Deakin
affiliation: University of Bristol
- name: Simon McIntosh-Smith
affiliation: University of Bristol
- name: Aksel Alpay
affiliation: Universität Heidelberg
- name: Vincent Heuveline
affiliation: Universität Heidelberg
tags:
- benchmarks
- extending
- parallelism
---

SYCL is an open-standard, parallel programming model for programming heterogeneous devices from Khronos. It allows
single-source programming of diverse attached devices in a cross-platform manner in modern C++. SYCL provides different
layers of parallel abstractions, including Same Instruction Multiple Thread (SIMT) kernels, data-parallel loop
concurrency and hierarchical parallelism. We discuss Scoped Parallelism as an extension to the existing Hierarchical
Parallelism in SYCL, and highlight the advantages and disadvantages of these models from the perspective of the
programmer and an implementer of SYCL. In this paper, we compare writing benchmark programs using SIMT kernel,
hierarchical parallelism and scoped parallelism paradigms, and present results running on a high-performance CPU and
GPU.
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
---
contributor: scott
date: '2021-11-14T08:08:10.490000+00:00'
title: 'Case Study of Using Kokkos and SYCL as Performance-Portable Frameworks for Milc-Dslash Benchmark on NVIDIA, AMD and Intel GPUs'
external_url: https://ieeexplore.ieee.org/document/9652859
authors:
- name: Amanda S. Dufek
affiliation: NERSC/LBNL
- name: Rahulkumar Gayatri
affiliation: NERSC/LBNL
- name: Neil Mehta
affiliation: NERSC/LBNL
- name: Douglas Doerfler
affiliation: NERSC/LBNL
- name: Brandon Cook
affiliation: NERSC/LBNL
- name: Yasaman Ghadar
affiliation: Argonne National Laboratory
- name: Carleton DeTar
affiliation: University of Utah
tags:
- kokkos
- milc-dslash
- performance
- portability
- nvidia
- intel
- amd
---

Six of the top ten supercomputers in the TOP500 list from June 2021 rely on NVIDIA GPUs to achieve their peak compute
bandwidth. With the announcement of Aurora, Frontier, and El Capitan, Intel and AMD have also entered the domain of
providing GPUs for scientific computing. A consequence of the increased diversity in the GPU landscape is the emergence
of portable programming models such as Kokkos, SYCL, OpenCL, and OpenMP, which allow application developers to maintain
a single-source code across a diverse range of hardware architectures. While the portable frameworks try to optimize the
compute resource usage on a given architecture, it is the programmers responsibility to expose parallelism in an
application that can take advantage of thousands of processing elements available on GPUs. In this paper, we introduce a
GPU-friendly parallel implementation of Milc-Dslash that exposes multiple hierarchies of parallelism in the algorithm.
Milc-Dslash was designed to serve as a benchmark with highly optimized matrix-vector multiplications to measure the
resource utilization on the GPU systems. The parallel hierarchies in the Milc-Dslash algorithm are mapped onto a target
hardware using Kokkos and SYCL programming models. We present the performance achieved by Kokkos and SYCL
implementations of Milc-Dslash on NVIDIA A100 GPU, AMD MI100 GPU, and Intel Gen9 GPU. Additionally, we compare the
Kokkos and SYCL performances with those obtained from the versions written in CUDA and HIP programming models on NVIDIA
A100 GPU and AMD MI100 GPU, respectively.
Loading