Skip to content

Commit 632fbc3

Browse files
author
Scott Straughan
committed
Fixed line length overflowing and some typos.
1 parent caa7424 commit 632fbc3

7 files changed

+87
-75
lines changed

content/research_papers/2023/2023-10-01-open-sycl-on-heterogeneous-gpu-systems-a-case-of-study.md

Lines changed: 16 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -16,18 +16,19 @@ tags:
1616
- hip
1717
---
1818

19-
Computational platforms for high-performance scientific applications are becoming more heterogeneous, including hardware
20-
accelerators such as multiple GPUs. Applications in a wide variety of scientific fields require an efficient and careful
21-
management of the computational resources of this type of hardware to obtain the best possible performance. However,
22-
there are currently different GPU vendors, architectures and families that can be found in heterogeneous clusters or
23-
machines. Programming with the vendor provided languages or frameworks, and optimizing for specific devices, may become
24-
cumbersome and compromise portability to other systems. To overcome this problem, several proposals for high-level
25-
heterogeneous programming have appeared, trying to reduce the development effort and increase functional and performance
26-
portability, specifically when using GPU hardware accelerators. This paper evaluates the SYCL programming model, using the
27-
Open SYCL compiler, from two different perspectives: The performance it offers when dealing with single or multiple GPU devices
28-
from the same or different vendors, and the development effort required to implement the code. We use as case of study the Finite Time Lyapunov Exponent
29-
calculation over two real-world scenarios and compare the performance and the development effort of its Open SYCL-based version against
30-
the equivalent versions that use CUDA or HIP. Based on the experimental results, we observe that the use of SYCL does not lead to a
31-
remarkable overhead in terms of the GPU kernels execution time. In general terms, the Open SYCL development effort for the host code
32-
is lower than that observed with CUDA or HIP. Moreover, the SYCL version can take advantage of both CUDA and AMD GPU devices simultaneously
33-
much easier than directly using the vendor-specific programming solutions.
19+
Computational platforms for high-performance scientific applications are becoming more heterogeneous, including hardware
20+
accelerators such as multiple GPUs. Applications in a wide variety of scientific fields require an efficient and careful
21+
management of the computational resources of this type of hardware to obtain the best possible performance. However,
22+
there are currently different GPU vendors, architectures and families that can be found in heterogeneous clusters or
23+
machines. Programming with the vendor provided languages or frameworks, and optimizing for specific devices, may become
24+
cumbersome and compromise portability to other systems. To overcome this problem, several proposals for high-level
25+
heterogeneous programming have appeared, trying to reduce the development effort and increase functional and performance
26+
portability, specifically when using GPU hardware accelerators. This paper evaluates the SYCL programming model, using
27+
the Open SYCL compiler, from two different perspectives: The performance it offers when dealing with single or multiple
28+
GPU devices from the same or different vendors, and the development effort required to implement the code. We use as
29+
case of study the Finite Time Lyapunov Exponent calculation over two real-world scenarios and compare the performance
30+
and the development effort of its Open SYCL-based version against the equivalent versions that use CUDA or HIP. Based on
31+
the experimental results, we observe that the use of SYCL does not lead to a remarkable overhead in terms of the GPU
32+
kernels execution time. In general terms, the Open SYCL development effort for the host code is lower than that observed
33+
with CUDA or HIP. Moreover, the SYCL version can take advantage of both CUDA and AMD GPU devices simultaneously much
34+
easier than directly using the vendor-specific programming solutions.

content/research_papers/2023/2023-10-24-a-performance-portable-sycl-implementation-of-crk-hacc-for-exascale.md

Lines changed: 11 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -18,12 +18,14 @@ tags:
1818
- heterogeneous-programming
1919
---
2020

21-
The first generation of exascale systems will include a variety of machine architectures, featuring GPUs from multiple vendors.
22-
As a result, many developers are interested in adopting portable pro- gramming models to avoid maintaining multiple versions of
23-
their code. It is necessary to document experiences with such program- ming models to assist developers in understanding the advantages and
24-
disadvantages of different approaches.
25-
To this end, this paper evaluates the performance portability of a SYCL implementation of a large-scale cosmology application (CRK-HACC)
26-
running on GPUs from three different vendors: AMD, Intel, and NVIDIA. We detail the process of migrating the original code from CUDA to
27-
SYCL and show that specializing kernels for specific targets can greatly improve performance portability with- out significantly impacting
28-
programmer productivity. The SYCL version of CRK-HACC achieves a performance portability of 0.96 with a code divergence of almost 0,
29-
demonstrating that SYCL is a viable programming model for performance-portable applications.
21+
The first generation of exascale systems will include a variety of machine architectures, featuring GPUs from multiple
22+
vendors. As a result, many developers are interested in adopting portable programming models to avoid maintaining
23+
multiple versions of their code. It is necessary to document experiences with such programming models to assist
24+
developers in understanding the advantages and disadvantages of different approaches.
25+
26+
To this end, this paper evaluates the performance portability of a SYCL implementation of a large-scale cosmology
27+
application (CRK-HACC) running on GPUs from three different vendors: AMD, Intel, and NVIDIA. We detail the process of
28+
migrating the original code from CUDA to SYCL and show that specializing kernels for specific targets can greatly
29+
improve performance portability without significantly impacting programmer productivity. The SYCL version of CRK-HACC
30+
achieves a performance portability of 0.96 with a code divergence of almost 0, demonstrating that SYCL is a viable
31+
programming model for performance-portable applications.

content/research_papers/2024/2024-01-05-preliminary-report-initial-evaluation-of-stdpar-implementations-on-amd-gpus-for-hpc.md

Lines changed: 13 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -13,14 +13,16 @@ tags:
1313
- hip
1414
---
1515

16-
Recently, AMD platforms have not supported offloading C++17 PSTL (StdPar) programs to the GPU. Our previous work highlights
17-
how StdPar is able to achieve good performance across NVIDIA and Intel GPU platforms. In that work, we acknowledged AMD’s past
18-
effort such as HCC, which unfortunately is deprecated and does not support newer hardware platforms. Recent developments by AMD, Codeplay,
19-
and AdaptiveCpp (previously known as hipSYCL or OpenSYCL) have enabled multiple paths for StdPar programs to run on AMD GPUs.
20-
This informal report discusses our experiences and evaluation of currently available StdPar implementations for AMD GPUs.
21-
We conduct benchmarks using our suite of HPC mini-apps with ports in many heterogeneous programming models, including StdPar.
22-
We then compare the performance of StdPar, using all available StdPar compilers, to contemporary heterogeneous programming models
23-
supported on AMD GPUs: HIP, OpenCL, Thrust, Kokkos, OpenMP, SYCL. Where appropriate, we discuss issues encountered and workarounds
24-
applied during our evaluation. Finally, the StdPar model discussed in this report largely depends on Unified Shared Memory (USM) performance
25-
and very few AMD GPUs have proper support for this feature. As such, this report demonstrates a proof-of-concept host-side userspace pagefault
26-
solution for models that use the HIP API. We discuss performance improvements achieved with our solution using the same set of benchmarks.
16+
Recently, AMD platforms have not supported offloading C++17 PSTL (StdPar) programs to the GPU. Our previous work
17+
highlights how StdPar is able to achieve good performance across NVIDIA and Intel GPU platforms. In that work, we
18+
acknowledged AMD’s past effort such as HCC, which unfortunately is deprecated and does not support newer hardware
19+
platforms. Recent developments by AMD, Codeplay, and AdaptiveCpp (previously known as hipSYCL or OpenSYCL) have enabled
20+
multiple paths for StdPar programs to run on AMD GPUs. This informal report discusses our experiences and evaluation of
21+
currently available StdPar implementations for AMD GPUs. We conduct benchmarks using our suite of HPC mini-apps with
22+
ports in many heterogeneous programming models, including StdPar. We then compare the performance of StdPar, using all
23+
available StdPar compilers, to contemporary heterogeneous programming models supported on AMD GPUs: HIP, OpenCL, Thrust,
24+
Kokkos, OpenMP, SYCL. Where appropriate, we discuss issues encountered and workarounds applied during our evaluation.
25+
Finally, the StdPar model discussed in this report largely depends on Unified Shared Memory (USM) performance and very
26+
few AMD GPUs have proper support for this feature. As such, this report demonstrates a proof-of-concept host-side
27+
userspace pagefault solution for models that use the HIP API. We discuss performance improvements achieved with our
28+
solution using the same set of benchmarks.

content/research_papers/2024/2024-01-24-lessons-learned-migrating-cuda-to-sycl-a-hep-case-study-with-root-rdataframe.md

Lines changed: 11 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -14,13 +14,14 @@ tags:
1414
- heterogeneous-programming
1515
---
1616

17-
The world’s largest particle accelerator, located at CERN, produces petabytes of data that need to be analysed efficiently,
18-
to study the fundamental structures of our universe. ROOT is an open-source C++ data analysis framework, developed for this
19-
purpose. Its high- level data analysis interface, RDataFrame, currently only supports CPU parallelism. Given the increasing
20-
heterogeneity in computing facilities, it becomes crucial to efficiently support GPGPUs to take advantage of the available
21-
resources. SYCL allows for a single-source implementation, which enables support for different architectures. In this paper,
22-
we describe a CUDA implementation and the migration process to SYCL, focusing on a core high energy physics operation in
23-
RDataFrame – histogramming. We detail the challenges that we faced when integrating SYCL into a large and complex code base.
24-
Furthermore, we perform an extensive comparative performance analysis of two SYCL compilers, AdaptiveCpp and DPC++, and the
25-
reference CUDA implementation. We highlight the performance bottlenecks that we encountered, and the methodology used to detect
26-
these. Based on our findings, we provide actionable insights for developers of SYCL applications.
17+
The world’s largest particle accelerator, located at CERN, produces petabytes of data that need to be analysed
18+
efficiently, to study the fundamental structures of our universe. ROOT is an open-source C++ data analysis framework,
19+
developed for this purpose. Its high-level data analysis interface, RDataFrame, currently only supports CPU parallelism.
20+
Given the increasing heterogeneity in computing facilities, it becomes crucial to efficiently support GPGPUs to take
21+
advantage of the available resources. SYCL allows for a single-source implementation, which enables support for
22+
different architectures. In this paper, we describe a CUDA implementation and the migration process to SYCL, focusing on
23+
a core high energy physics operation in RDataFrame – histogramming. We detail the challenges that we faced when
24+
integrating SYCL into a large and complex code base. Furthermore, we perform an extensive comparative performance
25+
analysis of two SYCL compilers, AdaptiveCpp and DPC++, and the reference CUDA implementation. We highlight the
26+
performance bottlenecks that we encountered, and the methodology used to detect these. Based on our findings, we provide
27+
actionable insights for developers of SYCL applications.

content/research_papers/2024/2024-03-09-xfluids-a-sycl-based-unified-cross-architecture-heterogeneous-simulation-solver-for-compressible-reacting-flows.md

Lines changed: 15 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -15,16 +15,18 @@ tags:
1515
---
1616

1717
We present a cross-architecture high-order heterogeneous Navier-Stokes simulation solver, XFluids, for compressible
18-
reacting multicomponent flows on different platforms. The multi-component reacting flows are ubiquitous in many scientific
19-
and engineering applications, while their numerical simulations are usually time-consuming to capture the underlying
20-
multiscale features. Although heterogeneous accelerated computing is significantly beneficial for large-scale simulations
21-
of these flows, effective utilization of various heterogeneous accelerators with different architectures and programming
22-
models in the market remains a challenge. To address this, we develop XFluids by SYCL, to perform acceleration directly
23-
targeted to different devices, without translating any source code. A variety of optimization techniques have been proposed
24-
to increase the computational performance of XFluids, including adaptive range assignment, partial eigensystem reconstruction,
25-
hotspot device function optimizations, etc. This solver has been open-sourced, and tested on multiple GPUs from different mainstream
26-
vendors, indicating high portability. Through various benchmark cases, the accuracy of XFluids is demonstrated, with approximately
27-
no efficiency loss compared to existing GPU programming models, such as CUDA and HIP. In addition, the MPI library is used to extend
28-
the solver to multi-GPU platforms, with the GPU-enabled MPI supported. With this, the weak scaling of XFluids for multi-GPU devices
29-
is larger than 95% for 1024 GPUs. Finally, we simulate both the inert and reactive multicomponent shock-bubble interaction problems
30-
with high-resolution meshes, to investigate the reacting effects on the mixing, vortex stretching, and shape deformation of the bubble evolution.
18+
reacting multi-component flows on different platforms. The multi-component reacting flows are ubiquitous in many
19+
scientific and engineering applications, while their numerical simulations are usually time-consuming to capture the
20+
underlying multiscale features. Although heterogeneous accelerated computing is significantly beneficial for large-scale
21+
simulations of these flows, effective utilization of various heterogeneous accelerators with different architectures and
22+
programming models in the market remains a challenge. To address this, we develop XFluids by SYCL, to perform
23+
acceleration directly targeted to different devices, without translating any source code. A variety of optimization
24+
techniques have been proposed to increase the computational performance of XFluids, including adaptive range assignment,
25+
partial eigen-system reconstruction, hotspot device function optimizations, etc. This solver has been open-sourced, and
26+
tested on multiple GPUs from different mainstream vendors, indicating high portability. Through various benchmark cases,
27+
the accuracy of XFluids is demonstrated, with approximately no efficiency loss compared to existing GPU programming
28+
models, such as CUDA and HIP. In addition, the MPI library is used to extend the solver to multi-GPU platforms, with the
29+
GPU-enabled MPI supported. With this, the weak scaling of XFluids for multi-GPU devices is larger than 95% for 1024
30+
GPUs. Finally, we simulate both the inert and reactive multi-component shock-bubble interaction problems with
31+
high-resolution meshes, to investigate the reacting effects on the mixing, vortex stretching, and shape deformation of
32+
the bubble evolution.

content/research_papers/2024/2024-04-16-enabling-performance-portability-on-the-ligen-drug-discovery-pipeline.md

Lines changed: 15 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -21,13 +21,18 @@ tags:
2121
- hip
2222
---
2323

24-
In recent years, there has been a growing interest in developing high-performance implementations of drug discovery processing software. To target modern GPU architectures,
25-
such applications are mostly written in proprietary languages such as CUDA or HIP. However, with the increasing heterogeneity of modern HPC systems and the availability of
26-
accelerators from multiple hardware vendors, it has become critical to be able to efficiently execute drug discovery pipelines on multiple large-scale computing systems,
27-
with the ultimate goal of working on urgent computing scenarios. This article presents the challenges of migrating LiGen, an industrial drug discovery software pipeline,
28-
from CUDA to the SYCL programming model, an industry standard based on C++ that enables heterogeneous computing. We perform a structured analysis of the performance portability
29-
of the SYCL LiGen platform, focusing on different aspects of the approach from different perspectives. First, we analyze the performance portability provided by the high-level semantics
30-
of SYCL, including the most recent group algorithms and subgroups of SYCL 2020. Second, we analyze how low-level aspects such as kernel occupancy and register pressure affect
31-
the performance portability of the overall application. The experimental evaluation is performed on two different versions of LiGen, implementing two different parallelization
32-
patterns, by comparing them with a manually optimized CUDA version, and by evaluating performance portability using both known and ad hoc metrics. The results show that,
33-
thanks to the combination of high-level SYCL semantics and some manual tuning, LiGen achieves native-comparable performance on NVIDIA, while also running on AMD GPUs.
24+
In recent years, there has been a growing interest in developing high-performance implementations of drug discovery
25+
processing software. To target modern GPU architectures, such applications are mostly written in proprietary languages
26+
such as CUDA or HIP. However, with the increasing heterogeneity of modern HPC systems and the availability of
27+
accelerators from multiple hardware vendors, it has become critical to be able to efficiently execute drug discovery
28+
pipelines on multiple large-scale computing systems, with the ultimate goal of working on urgent computing scenarios.
29+
This article presents the challenges of migrating LiGen, an industrial drug discovery software pipeline, from CUDA to
30+
the SYCL programming model, an industry standard based on C++ that enables heterogeneous computing. We perform a
31+
structured analysis of the performance portability of the SYCL LiGen platform, focusing on different aspects of the
32+
approach from different perspectives. First, we analyze the performance portability provided by the high-level semantics
33+
of SYCL, including the most recent group algorithms and subgroups of SYCL 2020. Second, we analyze how low-level aspects
34+
such as kernel occupancy and register pressure affect the performance portability of the overall application. The
35+
experimental evaluation is performed on two different versions of LiGen, implementing two different parallelization
36+
patterns, by comparing them with a manually optimized CUDA version, and by evaluating performance portability using both
37+
known and ad hoc metrics. The results show that, thanks to the combination of high-level SYCL semantics and some manual
38+
tuning, LiGen achieves native-comparable performance on NVIDIA, while also running on AMD GPUs.

content/videos/2024/2024-07-05-bring-your-code-to-riscv-accelerators-with-sycl.md

Lines changed: 6 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -9,11 +9,10 @@ tags:
99
- risc-v
1010
---
1111

12-
This talk will show attendees how to overcome proprietary code with RISC-V and SYCL.
13-
They will learn how they can achieve code portability and adopt RISC-V hardware without
14-
losing their existing work, for greater productivity.
12+
This talk will show attendees how to overcome proprietary code with RISC-V and SYCL. They will learn how they can
13+
achieve code portability and adopt RISC-V hardware without losing their existing work, for greater productivity.
1514

16-
The talk will also highlight the ongoing research into pioneering applications for RISC-V,
17-
funded by the EU Horizon programme. AERO and SYCLOPS are two such projects. AERO seeks
18-
to enable the future heterogeneous EU cloud infrastructure, while SYCLOPS will bring
19-
together the RISC-V and SYCL standards together into a single software stack for the first time.
15+
The talk will also highlight the ongoing research into pioneering applications for RISC-V, funded by the EU Horizon
16+
programme. AERO and SYCLOPS are two such projects. AERO seeks to enable the future heterogeneous EU cloud
17+
infrastructure, while SYCLOPS will bring together the RISC-V and SYCL standards together into a single software stack
18+
for the first time.

0 commit comments

Comments
 (0)