codeplaysoftware · scottstraughan · Jul 11, 2024 · Jul 11, 2024
diff --git a/...earch_papers/2016/2016-10-26-a-comparative-study-of-sycl-open-cl-and-open-mp.md b/...earch_papers/2016/2016-10-26-a-comparative-study-of-sycl-open-cl-and-open-mp.md
@@ -0,0 +1,31 @@
+---
+contributor: max
+date: '2016-10-26T10:57:29+01:00'
+title: 'A Comparative Study of SYCL, OpenCL, and OpenMP'
+external_url: https://ieeexplore.ieee.org/document/7803697
+authors:
+  - name: Hércules Cardoso da Silva
+    affiliation: Inst. of Comput
+  - name: Flávia Pisani
+    affiliation: Institute of Computing,
+  - name: Edson Borin
+    affiliation: Institute of Computing
+tags:
+  - opencl
+  - openmp
+  - parallel
+  - performance
+  - evaluation
+---
+
+Recent trends indicate that future computing systems will be composed by a group of heterogeneous computing devices,
+including CPUs, GPUs, and other hardware accelerators. These devices provide increased processing performance, however,
+creating efficient code for them may require that programmers manage memory assignments and use specialized APIs,
+compilers, or runtime systems, thus making their programs dependent on specific tools. In this scenario, SYCL is an
+emerging C++ programming model for OpenCL that allows developers to write code for heterogeneous computing devices that
+are compatible with standard C++ compilation frameworks. In this paper, we analyze the performance and programming
+characteristics of SYCL, OpenMP, and OpenCL using both a benchmark and a real-world application. Our performance results
+indicate that programs that rely on available SYCL runtimes are not on par with the ones based on OpenMP and OpenCL yet.
+Nonetheless, the gap is getting smaller if we consider the results reported by previous studies. In terms of
+programmability, SYCL presents itself as a competitive alternative to OpenCL, requiring fewer lines of code to implement
+kernels and also fewer calls to essential API functions and methods.
diff --git a/...18/2018-07-10-solving-maxwells-equations-with-modern-c-and-sycl-a-case-study.md b/...18/2018-07-10-solving-maxwells-equations-with-modern-c-and-sycl-a-case-study.md
@@ -0,0 +1,39 @@
+---
+contributor: scott
+date: '2018-07-10T08:08:10.490000+00:00'
+title: 'Solving Maxwells Equations with Modern C++ and SYCL: A Case Study'
+external_url: https://ieeexplore.ieee.org/document/8445127
+authors:
+  - name: Ayesha Afzal
+    affiliation: Friedrich-Alexander university Erlangen-Nurnberg
+  - name: Christian Schmitt
+    affiliation: Friedrich-Alexander university Erlangen-Nurnberg
+  - name: Samer Alhaddad
+    affiliation: Paderborn University
+  - name: Yevgen Grynko
+    affiliation: Paderborn University
+  - name: Jurgen Teich
+    affiliation: Friedrich-Alexander university Erlangen-Nurnberg
+  - name: Jens Forstner
+    affiliation: Paderborn University
+  - name: Frank Hannig
+    affiliation: Friedrich-Alexander university Erlangen-Nurnberg
+tags:
+  - maxwell
+  - c++
+  - case-study
+---
+
+In scientific computing, unstructured meshes are a crucial foundation for the simulation of real-world physical
+phenomena. Compared to regular grids, they allow resembling the computational domain with a much higher accuracy, which
+in turn leads to more efficient computations. There exists a wealth of supporting libraries and frameworks that aid
+programmers with the implementation of applications working on such grids, each built on top of existing parallelization
+technologies. However, many approaches require the programmer to introduce a different programming paradigm into their
+application or provide different variants of the code. SYCL is a new programming standard providing a remedy to this
+dilemma by building on standard C++ 17 with its so-called single-source approach: Programmers write standard C++ code
+and expose parallelism using C++ 17 keywords. The application is then transformed into a concrete implementation by the
+SYCL implementation. By encapsulating the OpenCL ecosystem, different SYCL implementations enable not only the
+programming of CPUs but also of heterogeneous platforms such as GPUs or other devices. For the first time, this paper
+showcases a SY CL-based solver for the nodal Discontinuous Galerkin method for Maxwell's equations on unstructured
+meshes. We compare our solution to a previous C-based implementation with respect to programmability and performance on
+heterogeneous platforms.
diff --git a/...papers/2019/2019-11-18-evaluation-of-medical-imaging-applications-using-sycl.md b/...papers/2019/2019-11-18-evaluation-of-medical-imaging-applications-using-sycl.md
@@ -0,0 +1,30 @@
+---
+contributor: scott
+date: '2019-11-18T10:57:29+01:00'
+title: Evaluation of Medical Imaging Applications using SYCL
+external_url: https://ieeexplore.ieee.org/document/8982983
+authors:
+  - name: Zheming Jin
+    affiliation: Argonne National Laboratory
+  - name: Hal Finkel
+    affiliation: Argonne National Laboratory
+tags:
+  - benchmark
+  - performance
+  - medical
+  - rodina
+  - imaging
+---
+
+As opposed to the Open Computing Language (OpenCL) programming model in which host and device codes are written in
+different languages, the SYCL programming model can combine host and device codes for an application in a type-safe way
+to improve development productivity. In this paper, we chose two medical imaging applications (Heart Wall and Particle
+Filter) in the Rodinia benchmark suite to study the performance and programming productivity of the SYCL programming
+model. More specifically, we introduced the SYCL programming model, shared our experience of implementing the
+applications using SYCL, and compared the performance and programming portability of the SYCL implementations with the
+OpenCL implementations on an Intel® Xeon® CPU and an Iris® Pro integrated GPU. The results are promising. For the Heart
+Wall application, the SYCL implementation is on average 15% faster than the OpenCL implementation on the GPU. For the
+Particle Filter application, the SYCL implementation is 3% slower than the OpenCL implementation on the GPU, but it is
+75% faster on the CPU. Using lines of code as an indicator of programming productivity, the SYCL host program reduces
+the lines of code of the OpenCL host program by 52% and 38% for the Heart Wall and Particle Filter applications,
+respectively.
diff --git a/...rtability-of-a-wilson-dslash-stencil-operator-mini-app-using-kokkos-and-sycl.md b/...rtability-of-a-wilson-dslash-stencil-operator-mini-app-using-kokkos-and-sycl.md
@@ -0,0 +1,33 @@
+---
+contributor: scott
+date: '2019-11-22T10:57:29+01:00'
+title: 'Performance Portability of a Wilson Dslash Stencil Operator Mini-App Using Kokkos and SYCL'
+external_url: https://ieeexplore.ieee.org/document/8945798
+authors:
+  - name: Bálint Joó
+    affiliation: Jefferson Lab
+  - name: Thorsten Kurth
+    affiliation: NERSC
+  - name: M. A. Clark
+    affiliation: NVIDIA
+  - name: Jeongnim Kim
+    affiliation: Intel Corporation
+  - name: Christian Robert Trott
+    affiliation: Sandia National Laboratories
+  - name: Dan Ibanez
+    affiliation: Sandia National Laboratories
+  - name: Daniel Sunderland
+    affiliation: Sandia National Laboratories
+  - name: Jack Deslippe
+    affiliation: NERSC
+tags:
+  - kokkos
+  - performance
+  - portability
+  - lattice-qcd
+---
+
+We describe our experiences in creating mini-apps for the Wilson-Dslash stencil operator for Lattice Quantum
+Chromodynamics using the Kokkos and SYCL programming models. In particular we comment on the performance achieved on a
+variety of hardware architectures, limitations we have reached in both programming models and how these have been
+resolved by us, or may be resolved by the developers of these models.
diff --git a/...esearch_papers/2019/2019-12-09-a-case-study-of-k-means-clustering-using-sycl.md b/...esearch_papers/2019/2019-12-09-a-case-study-of-k-means-clustering-using-sycl.md
@@ -0,0 +1,42 @@
+---
+contributor: scott
+date: '2019-12-09T10:57:29+01:00'
+title: A Case Study of k-means Clustering using SYCL
+external_url: https://ieeexplore.ieee.org/document/9005555
+authors:
+  - name: Zheming Jin
+    affiliation: Argonne National Laboratory
+  - name: Hal Finkel
+    affiliation: Argonne National Laboratory
+tags:
+  - benchmark
+  - energy-consumption
+  - programming-language
+  - gpu
+  - lowest-consumption
+  - rodinia
+  - minimum-distance
+  - data-transfer
+  - api
+  - means-clustering
+  - fuzzy-clustering
+  - haswell
+  - broadwell
+  - skywell
+---
+
+As opposed to the OpenCL programming model in which host and device codes are written in two programming languages, the
+SYCL programming model combines them for an application in a type-safe way to improve development productivity. As a
+popular cluster analysis algorithm, k-means has been implemented using programming models such as OpenMP, OpenCL, and
+CUDA. Developing a SYCL implementation of k-means as a case study allows us to have a better understanding of
+performance portability and programming productivity of the SYCL programming model. Specifically, we explained the
+k-means benchmark in Rodinia, described our efforts of porting the OpenCL k-means benchmark, and evaluated the
+performance of the OpenCL and SYCL implementations on the Intel ® Haswell, Broadwell, and Skylake processors. We
+summarized the migration steps from OpenCL to SYCL, compiled the SYCL program using Codeplay and Intel ® SYCL compilers,
+analyzed the SYCL and OpenCL programs using an open-source profiling tool which can intercept OpenCL runtime calls, and
+compared the performance of the implementations on Intel ® CPUs and integrated GPU. The experimental results show that
+the SYCL version in which the kernels run on the GPU is 2% and 8% faster than the OpenCL version for the two large
+datasets. However, the OpenCL version is still much faster than the SYCL version on the CPUs. Compared to the Intel ®
+Haswell and Skylake CPUs, running the k-means benchmark on the Intel ® Broadwell low-power processor with a CPU and an
+integrated GPU can achieve the lowest energy consumption. In terms of programming productivity, the lines of code of the
+SYCL program are 51% fewer than those of the OpenCL program.
diff --git a/...020-05-18-a-case-study-on-the-hac-cmk-routine-in-sycl-on-integrated-graphics.md b/...020-05-18-a-case-study-on-the-hac-cmk-routine-in-sycl-on-integrated-graphics.md
@@ -0,0 +1,29 @@
+---
+contributor: scott
+date: '2022-05-18T08:08:10.490000+00:00'
+title: 'A Case Study on the HACCmk Routine in SYCL on Integrated Graphics'
+external_url: https://ieeexplore.ieee.org/document/9150310
+authors:
+  - name: Zheming Jin
+    affiliation: Argonne National Laboratory
+  - name: Vitali Morozov
+    affiliation: Argonne National Laboratory
+  - name: Hal Finkel
+    affiliation: Argonne National Laboratory
+tags:
+  - compute
+  - haccmk
+  - integrated-grapghics
+  - case-study
+---
+
+As opposed to the Open Computing Language (OpenCL) programming model in which host and device codes are generally
+written in different languages, the SYCL programming model can combine host and device codes for an application in a
+type-safe way to improve development productivity. In this paper, we chose the HACCmk routine, a representative
+compute-bound kernel, as a case study on the performance of the SYCL programming model targeting a heterogeneous
+computing device. More specifically, we introduced the SYCL programming model, presented the OpenCL and SYCL
+implementations of the routine, and compared the performance of the two implementations using the offline and online
+compilation on Intel® IrisTM Pro integrated GPUs. We found that the overhead of online compilation may become
+significant compared to the execution time of a kernel. Compared to the performance of OpenCL implementations, the SYCL
+implementation can maintain the performance using the offline compilation. The number of execution units in a GPU are
+critical to improving the raw performance of a compute-bound kernel.
diff --git a/...rds-automated-kernel-selection-in-machine-learning-systems-a-sycl-case-study.md b/...rds-automated-kernel-selection-in-machine-learning-systems-a-sycl-case-study.md
@@ -0,0 +1,25 @@
+---
+contributor: scott
+date: '2022-05-18T08:08:10.490000+00:00'
+title: 'Towards automated kernel selection in machine learning systems: A SYCL case study'
+external_url: https://ieeexplore.ieee.org/document/9150358
+authors:
+  - name: John Lawson
+    affiliation: Codeplay Software Ltd
+tags:
+  - tuning
+  - sycl
+  - gpgpu
+  - machine-learning
+  - ai
+---
+
+Automated tuning of compute kernels is a popular area of research, mainly focused on finding optimal kernel parameters
+for a problem with fixed input sizes. This approach is good for deploying machine learning models, where the network
+topology is constant, but machine learning research often involves changing network topologies and hyperparameters.
+Traditional kernel auto-tuning has limited impact in this case; a more general selection of kernels is required for
+libraries to accelerate machine learning research. In this paper we present initial results using machine learning to
+select kernels in a case study deploying high performance SYCL kernels in libraries that target a range of heterogeneous
+devices from desktop GPUs to embedded accelerators. The techniques investigated apply more generally and could similarly
+be integrated with other heterogeneous programming systems. By combining auto-tuning and machine learning these kernel
+selection processes can be deployed with little developer effort to achieve high performance on new hardware.
diff --git a/...luating-the-performance-and-portability-of-contemporary-sycl-implementations.md b/...luating-the-performance-and-portability-of-contemporary-sycl-implementations.md
@@ -0,0 +1,31 @@
+---
+contributor: scott
+date: '2020-11-13T08:08:10.490000+00:00'
+title: 'Evaluating the Performance and Portability of Contemporary SYCL Implementations'
+external_url: https://ieeexplore.ieee.org/document/9309045
+authors:
+  - name: Beau Johnston
+    affiliation: Oak Ridge National Laboratory
+  - name: Jeffrey S. Vetter
+    affiliation: Oak Ridge National Laboratory
+  - name: Josh Milthorpe
+    affiliation: Australian National University
+tags:
+  - benchmarks
+  - performance
+  - portability
+---
+
+SYCL is a single-source programming model for heterogeneous systems; it promises improved maintainability, productivity,
+and opportunity for compiler optimization, when compared to accelerator specific programming models. Several
+implementations of the SYCL standard have been developed over the past few years, including several backends using
+contemporary accelerator languages, like OpenCL, CUDA, and HIP. These implementations vary widely in their support for
+specific features of the standard and in their performance. As SYCL grows in popularity, developers need to know how
+features are implemented across popular implementations in order to make proper design choices. In this paper, we
+evaluate the existing SYCL implementations for important SYCL features across a range of hardware in order to understand
+SYCL's performance and portability. This work uses the newest SYCL benchmark suite (SYCL-Bench, 38 kernels) to evaluate
+these four existing implementations, comparing support of language features across backends and highlighting feature
+completeness and performance. For features, we focus on the five major SYCL parallel constructs, using a motivating
+example of the matrix multiplication benchmark. Our results show that the basic data parallelism construct is the best
+choice for performance on current SYCL implementations, and we identify opportunities for improvement in several of the
+SYCL implementations.
diff --git a/...21-09-07-automatic-parallelization-of-structured-mesh-computations-with-sycl.md b/...21-09-07-automatic-parallelization-of-structured-mesh-computations-with-sycl.md
@@ -0,0 +1,25 @@
+---
+contributor: scott
+date: '2021-06-07T08:08:10.490000+00:00'
+title: 'Automatic Parallelization of Structured Mesh Computations with SYCL'
+external_url: https://ieeexplore.ieee.org/document/9555976
+authors:
+  - name: Gábor Dániel Balogh
+    affiliation: Pázmány Péter Catholic University
+  - name: István Reguly
+    affiliation: Pázmány Péter Catholic University
+tags:
+  - parallel-programming
+  - nvidia
+  - intel
+---
+
+Structured meshes are widely used for scientific computations such as Computational Fluid Dynamics (CFD) applications or
+finance. Modern applications often have grid points in the millions. To perform such computations parallelisation is
+crucial. However it is unfeasible to port each application every time a new architecture arrives, hence in recent years
+the demand for automatic parallelisation and optimisation for the used hardware is increasing. The OPS (Oxford Parallel
+library for Structured mesh solvers) has shown good performance and scaling on a wide range of HPC architectures. This
+research aims to extend the OPS framework with a SYCL backend to extend the range of architectures that OPS can support
+and further increase Performance Portability of OPS applications. The performance of the Intel OneAPI is struggling with
+reductions due to high synchronisation cost, but shows promising performance gain on builtin reduction constructs on an
+Intel® Xeon® Gold 6226R. We compare the performance of hipSYCL on NVidia V100 GPU to the CUDA implementations.
diff --git a/...ers/2021/2021-11-14-benchmarking-and-extending-sycl-hierarchical-parallelism.md b/...ers/2021/2021-11-14-benchmarking-and-extending-sycl-hierarchical-parallelism.md
@@ -0,0 +1,28 @@
+---
+contributor: scott
+date: '2021-11-14T08:08:10.490000+00:00'
+title: 'Benchmarking and Extending SYCL Hierarchical Parallelism'
+external_url: https://ieeexplore.ieee.org/document/9654235
+authors:
+  - name: Tom Deakin
+    affiliation: University of Bristol
+  - name: Simon McIntosh-Smith
+    affiliation: University of Bristol
+  - name: Aksel Alpay
+    affiliation:  Universität Heidelberg
+  - name: Vincent Heuveline
+    affiliation: Universität Heidelberg
+tags:
+  - benchmarks
+  - extending
+  - parallelism
+---
+
+SYCL is an open-standard, parallel programming model for programming heterogeneous devices from Khronos. It allows
+single-source programming of diverse attached devices in a cross-platform manner in modern C++. SYCL provides different
+layers of parallel abstractions, including Same Instruction Multiple Thread (SIMT) kernels, data-parallel loop
+concurrency and hierarchical parallelism. We discuss Scoped Parallelism as an extension to the existing Hierarchical
+Parallelism in SYCL, and highlight the advantages and disadvantages of these models from the perspective of the
+programmer and an implementer of SYCL. In this paper, we compare writing benchmark programs using SIMT kernel,
+hierarchical parallelism and scoped parallelism paradigms, and present results running on a high-performance CPU and
+GPU.
diff --git a/...-portable-frameworks-for-milc-dslash-benchmark-on-nvidia-amd-and-intel-gp-us.md b/...-portable-frameworks-for-milc-dslash-benchmark-on-nvidia-amd-and-intel-gp-us.md
@@ -0,0 +1,44 @@
+---
+contributor: scott
+date: '2021-11-14T08:08:10.490000+00:00'
+title: 'Case Study of Using Kokkos and SYCL as Performance-Portable Frameworks for Milc-Dslash Benchmark on NVIDIA, AMD and Intel GPUs'
+external_url: https://ieeexplore.ieee.org/document/9652859
+authors:
+  - name: Amanda S. Dufek
+    affiliation: NERSC/LBNL
+  - name: Rahulkumar Gayatri
+    affiliation: NERSC/LBNL
+  - name: Neil Mehta
+    affiliation: NERSC/LBNL
+  - name: Douglas Doerfler
+    affiliation: NERSC/LBNL
+  - name: Brandon Cook
+    affiliation: NERSC/LBNL
+  - name: Yasaman Ghadar
+    affiliation: Argonne National Laboratory
+  - name: Carleton DeTar
+    affiliation: University of Utah
+tags:
+  - kokkos
+  - milc-dslash
+  - performance
+  - portability
+  - nvidia
+  - intel
+  - amd
+---
+
+Six of the top ten supercomputers in the TOP500 list from June 2021 rely on NVIDIA GPUs to achieve their peak compute
+bandwidth. With the announcement of Aurora, Frontier, and El Capitan, Intel and AMD have also entered the domain of
+providing GPUs for scientific computing. A consequence of the increased diversity in the GPU landscape is the emergence
+of portable programming models such as Kokkos, SYCL, OpenCL, and OpenMP, which allow application developers to maintain
+a single-source code across a diverse range of hardware architectures. While the portable frameworks try to optimize the
+compute resource usage on a given architecture, it is the programmers responsibility to expose parallelism in an
+application that can take advantage of thousands of processing elements available on GPUs. In this paper, we introduce a
+GPU-friendly parallel implementation of Milc-Dslash that exposes multiple hierarchies of parallelism in the algorithm.
+Milc-Dslash was designed to serve as a benchmark with highly optimized matrix-vector multiplications to measure the
+resource utilization on the GPU systems. The parallel hierarchies in the Milc-Dslash algorithm are mapped onto a target
+hardware using Kokkos and SYCL programming models. We present the performance achieved by Kokkos and SYCL
+implementations of Milc-Dslash on NVIDIA A100 GPU, AMD MI100 GPU, and Intel Gen9 GPU. Additionally, we compare the
+Kokkos and SYCL performances with those obtained from the versions written in CUDA and HIP programming models on NVIDIA
+A100 GPU and AMD MI100 GPU, respectively.