Fixed line length overflowing and some typos.

Scott Straughan · Scott Straughan · commit 632fbc3259d3 · 2024-08-13T11:53:19.000+01:00
diff --git a/content/research_papers/2023/2023-10-01-open-sycl-on-heterogeneous-gpu-systems-a-case-of-study.md b/content/research_papers/2023/2023-10-01-open-sycl-on-heterogeneous-gpu-systems-a-case-of-study.md
@@ -16,18 +16,19 @@ tags:
   - hip
 ---
 
-Computational platforms for high-performance scientific applications are becoming more heterogeneous, including hardware 
-accelerators such as multiple GPUs. Applications in a wide variety of scientific fields require an efficient and careful 
-management of the computational resources of this type of hardware to obtain the best possible performance. However, 
-there are currently different GPU vendors, architectures and families that can be found in heterogeneous clusters or 
-machines. Programming with the vendor provided languages or frameworks, and optimizing for specific devices, may become 
-cumbersome and compromise portability to other systems. To overcome this problem, several proposals for high-level 
-heterogeneous programming have appeared, trying to reduce the development effort and increase functional and performance 
-portability, specifically when using GPU hardware accelerators. This paper evaluates the SYCL programming model, using the 
-Open SYCL compiler, from two different perspectives: The performance it offers when dealing with single or multiple GPU devices 
-from the same or different vendors, and the development effort required to implement the code. We use as case of study the Finite Time Lyapunov Exponent 
-calculation over two real-world scenarios and compare the performance and the development effort of its Open SYCL-based version against 
-the equivalent versions that use CUDA or HIP. Based on the experimental results, we observe that the use of SYCL does not lead to a 
-remarkable overhead in terms of the GPU kernels execution time. In general terms, the Open SYCL development effort for the host code 
-is lower than that observed with CUDA or HIP. Moreover, the SYCL version can take advantage of both CUDA and AMD GPU devices simultaneously 
-much easier than directly using the vendor-specific programming solutions.
+Computational platforms for high-performance scientific applications are becoming more heterogeneous, including hardware
+accelerators such as multiple GPUs. Applications in a wide variety of scientific fields require an efficient and careful
+management of the computational resources of this type of hardware to obtain the best possible performance. However,
+there are currently different GPU vendors, architectures and families that can be found in heterogeneous clusters or
+machines. Programming with the vendor provided languages or frameworks, and optimizing for specific devices, may become
+cumbersome and compromise portability to other systems. To overcome this problem, several proposals for high-level
+heterogeneous programming have appeared, trying to reduce the development effort and increase functional and performance
+portability, specifically when using GPU hardware accelerators. This paper evaluates the SYCL programming model, using
+the Open SYCL compiler, from two different perspectives: The performance it offers when dealing with single or multiple
+GPU devices from the same or different vendors, and the development effort required to implement the code. We use as
+case of study the Finite Time Lyapunov Exponent calculation over two real-world scenarios and compare the performance
+and the development effort of its Open SYCL-based version against the equivalent versions that use CUDA or HIP. Based on
+the experimental results, we observe that the use of SYCL does not lead to a remarkable overhead in terms of the GPU
+kernels execution time. In general terms, the Open SYCL development effort for the host code is lower than that observed
+with CUDA or HIP. Moreover, the SYCL version can take advantage of both CUDA and AMD GPU devices simultaneously much
+easier than directly using the vendor-specific programming solutions.
diff --git a/content/research_papers/2023/2023-10-24-a-performance-portable-sycl-implementation-of-crk-hacc-for-exascale.md b/content/research_papers/2023/2023-10-24-a-performance-portable-sycl-implementation-of-crk-hacc-for-exascale.md
@@ -18,12 +18,14 @@ tags:
   - heterogeneous-programming
 ---
 
-The first generation of exascale systems will include a variety of machine architectures, featuring GPUs from multiple vendors.
-As a result, many developers are interested in adopting portable pro- gramming models to avoid maintaining multiple versions of 
-their code. It is necessary to document experiences with such program- ming models to assist developers in understanding the advantages and 
-disadvantages of different approaches.
-To this end, this paper evaluates the performance portability of a SYCL implementation of a large-scale cosmology application (CRK-HACC) 
-running on GPUs from three different vendors: AMD, Intel, and NVIDIA. We detail the process of migrating the original code from CUDA to 
-SYCL and show that specializing kernels for specific targets can greatly improve performance portability with- out significantly impacting 
-programmer productivity. The SYCL version of CRK-HACC achieves a performance portability of 0.96 with a code divergence of almost 0, 
-demonstrating that SYCL is a viable programming model for performance-portable applications.
+The first generation of exascale systems will include a variety of machine architectures, featuring GPUs from multiple
+vendors. As a result, many developers are interested in adopting portable programming models to avoid maintaining
+multiple versions of their code. It is necessary to document experiences with such programming models to assist
+developers in understanding the advantages and disadvantages of different approaches.
+
+To this end, this paper evaluates the performance portability of a SYCL implementation of a large-scale cosmology
+application (CRK-HACC) running on GPUs from three different vendors: AMD, Intel, and NVIDIA. We detail the process of
+migrating the original code from CUDA to SYCL and show that specializing kernels for specific targets can greatly
+improve performance portability without significantly impacting programmer productivity. The SYCL version of CRK-HACC
+achieves a performance portability of 0.96 with a code divergence of almost 0, demonstrating that SYCL is a viable
+programming model for performance-portable applications.
diff --git a/content/research_papers/2024/2024-01-05-preliminary-report-initial-evaluation-of-stdpar-implementations-on-amd-gpus-for-hpc.md b/content/research_papers/2024/2024-01-05-preliminary-report-initial-evaluation-of-stdpar-implementations-on-amd-gpus-for-hpc.md
@@ -13,14 +13,16 @@ tags:
   - hip
 ---
 
-Recently, AMD platforms have not supported offloading C++17 PSTL (StdPar) programs to the GPU. Our previous work highlights
-how StdPar is able to achieve good performance across NVIDIA and Intel GPU platforms. In that work, we acknowledged AMD’s past 
-effort such as HCC, which unfortunately is deprecated and does not support newer hardware platforms. Recent developments by AMD, Codeplay, 
-and AdaptiveCpp (previously known as hipSYCL or OpenSYCL) have enabled multiple paths for StdPar programs to run on AMD GPUs. 
-This informal report discusses our experiences and evaluation of currently available StdPar implementations for AMD GPUs. 
-We conduct benchmarks using our suite of HPC mini-apps with ports in many heterogeneous programming models, including StdPar. 
-We then compare the performance of StdPar, using all available StdPar compilers, to contemporary heterogeneous programming models
-supported on AMD GPUs: HIP, OpenCL, Thrust, Kokkos, OpenMP, SYCL. Where appropriate, we discuss issues encountered and workarounds
-applied during our evaluation. Finally, the StdPar model discussed in this report largely depends on Unified Shared Memory (USM) performance
-and very few AMD GPUs have proper support for this feature. As such, this report demonstrates a proof-of-concept host-side userspace pagefault
-solution for models that use the HIP API. We discuss performance improvements achieved with our solution using the same set of benchmarks.
+Recently, AMD platforms have not supported offloading C++17 PSTL (StdPar) programs to the GPU. Our previous work
+highlights how StdPar is able to achieve good performance across NVIDIA and Intel GPU platforms. In that work, we
+acknowledged AMD’s past effort such as HCC, which unfortunately is deprecated and does not support newer hardware
+platforms. Recent developments by AMD, Codeplay, and AdaptiveCpp (previously known as hipSYCL or OpenSYCL) have enabled
+multiple paths for StdPar programs to run on AMD GPUs. This informal report discusses our experiences and evaluation of
+currently available StdPar implementations for AMD GPUs. We conduct benchmarks using our suite of HPC mini-apps with
+ports in many heterogeneous programming models, including StdPar. We then compare the performance of StdPar, using all
+available StdPar compilers, to contemporary heterogeneous programming models supported on AMD GPUs: HIP, OpenCL, Thrust,
+Kokkos, OpenMP, SYCL. Where appropriate, we discuss issues encountered and workarounds applied during our evaluation.
+Finally, the StdPar model discussed in this report largely depends on Unified Shared Memory (USM) performance and very
+few AMD GPUs have proper support for this feature. As such, this report demonstrates a proof-of-concept host-side
+userspace pagefault solution for models that use the HIP API. We discuss performance improvements achieved with our
+solution using the same set of benchmarks.
diff --git a/content/research_papers/2024/2024-01-24-lessons-learned-migrating-cuda-to-sycl-a-hep-case-study-with-root-rdataframe.md b/content/research_papers/2024/2024-01-24-lessons-learned-migrating-cuda-to-sycl-a-hep-case-study-with-root-rdataframe.md
@@ -14,13 +14,14 @@ tags:
   - heterogeneous-programming
 ---
 
-The world’s largest particle accelerator, located at CERN, produces petabytes of data that need to be analysed efficiently, 
-to study the fundamental structures of our universe. ROOT is an open-source C++ data analysis framework, developed for this 
-purpose. Its high- level data analysis interface, RDataFrame, currently only supports CPU parallelism. Given the increasing 
-heterogeneity in computing facilities, it becomes crucial to efficiently support GPGPUs to take advantage of the available 
-resources. SYCL allows for a single-source implementation, which enables support for different architectures. In this paper, 
-we describe a CUDA implementation and the migration process to SYCL, focusing on a core high energy physics operation in 
-RDataFrame – histogramming. We detail the challenges that we faced when integrating SYCL into a large and complex code base. 
-Furthermore, we perform an extensive comparative performance analysis of two SYCL compilers, AdaptiveCpp and DPC++, and the 
-reference CUDA implementation. We highlight the performance bottlenecks that we encountered, and the methodology used to detect 
-these. Based on our findings, we provide actionable insights for developers of SYCL applications.
+The world’s largest particle accelerator, located at CERN, produces petabytes of data that need to be analysed
+efficiently, to study the fundamental structures of our universe. ROOT is an open-source C++ data analysis framework,
+developed for this purpose. Its high-level data analysis interface, RDataFrame, currently only supports CPU parallelism.
+Given the increasing heterogeneity in computing facilities, it becomes crucial to efficiently support GPGPUs to take
+advantage of the available resources. SYCL allows for a single-source implementation, which enables support for
+different architectures. In this paper, we describe a CUDA implementation and the migration process to SYCL, focusing on
+a core high energy physics operation in RDataFrame – histogramming. We detail the challenges that we faced when
+integrating SYCL into a large and complex code base. Furthermore, we perform an extensive comparative performance
+analysis of two SYCL compilers, AdaptiveCpp and DPC++, and the reference CUDA implementation. We highlight the
+performance bottlenecks that we encountered, and the methodology used to detect these. Based on our findings, we provide
+actionable insights for developers of SYCL applications.
diff --git a/content/research_papers/2024/2024-03-09-xfluids-a-sycl-based-unified-cross-architecture-heterogeneous-simulation-solver-for-compressible-reacting-flows.md b/content/research_papers/2024/2024-03-09-xfluids-a-sycl-based-unified-cross-architecture-heterogeneous-simulation-solver-for-compressible-reacting-flows.md
@@ -15,16 +15,18 @@ tags:
 ---
 
 We present a cross-architecture high-order heterogeneous Navier-Stokes simulation solver, XFluids, for compressible
-reacting multicomponent flows on different platforms. The multi-component reacting flows are ubiquitous in many scientific
-and engineering applications, while their numerical simulations are usually time-consuming to capture the underlying 
-multiscale features. Although heterogeneous accelerated computing is significantly beneficial for large-scale simulations
-of these flows, effective utilization of various heterogeneous accelerators with different architectures and programming
-models in the market remains a challenge. To address this, we develop XFluids by SYCL, to perform acceleration directly
-targeted to different devices, without translating any source code. A variety of optimization techniques have been proposed
-to increase the computational performance of XFluids, including adaptive range assignment, partial eigensystem reconstruction,
-hotspot device function optimizations, etc. This solver has been open-sourced, and tested on multiple GPUs from different mainstream
-vendors, indicating high portability. Through various benchmark cases, the accuracy of XFluids is demonstrated, with approximately
-no efficiency loss compared to existing GPU programming models, such as CUDA and HIP. In addition, the MPI library is used to extend
-the solver to multi-GPU platforms, with the GPU-enabled MPI supported. With this, the weak scaling of XFluids for multi-GPU devices
-is larger than 95% for 1024 GPUs. Finally, we simulate both the inert and reactive multicomponent shock-bubble interaction problems
-with high-resolution meshes, to investigate the reacting effects on the mixing, vortex stretching, and shape deformation of the bubble evolution.
+reacting multi-component flows on different platforms. The multi-component reacting flows are ubiquitous in many
+scientific and engineering applications, while their numerical simulations are usually time-consuming to capture the
+underlying multiscale features. Although heterogeneous accelerated computing is significantly beneficial for large-scale
+simulations of these flows, effective utilization of various heterogeneous accelerators with different architectures and
+programming models in the market remains a challenge. To address this, we develop XFluids by SYCL, to perform
+acceleration directly targeted to different devices, without translating any source code. A variety of optimization
+techniques have been proposed to increase the computational performance of XFluids, including adaptive range assignment,
+partial eigen-system reconstruction, hotspot device function optimizations, etc. This solver has been open-sourced, and
+tested on multiple GPUs from different mainstream vendors, indicating high portability. Through various benchmark cases,
+the accuracy of XFluids is demonstrated, with approximately no efficiency loss compared to existing GPU programming
+models, such as CUDA and HIP. In addition, the MPI library is used to extend the solver to multi-GPU platforms, with the
+GPU-enabled MPI supported. With this, the weak scaling of XFluids for multi-GPU devices is larger than 95% for 1024
+GPUs. Finally, we simulate both the inert and reactive multi-component shock-bubble interaction problems with
+high-resolution meshes, to investigate the reacting effects on the mixing, vortex stretching, and shape deformation of
+the bubble evolution.
diff --git a/content/research_papers/2024/2024-04-16-enabling-performance-portability-on-the-ligen-drug-discovery-pipeline.md b/content/research_papers/2024/2024-04-16-enabling-performance-portability-on-the-ligen-drug-discovery-pipeline.md
@@ -21,13 +21,18 @@ tags:
   - hip
 ---
 
-In recent years, there has been a growing interest in developing high-performance implementations of drug discovery processing software. To target modern GPU architectures, 
-such applications are mostly written in proprietary languages such as CUDA or HIP. However, with the increasing heterogeneity of modern HPC systems and the availability of 
-accelerators from multiple hardware vendors, it has become critical to be able to efficiently execute drug discovery pipelines on multiple large-scale computing systems, 
-with the ultimate goal of working on urgent computing scenarios. This article presents the challenges of migrating LiGen, an industrial drug discovery software pipeline, 
-from CUDA to the SYCL programming model, an industry standard based on C++ that enables heterogeneous computing. We perform a structured analysis of the performance portability 
-of the SYCL LiGen platform, focusing on different aspects of the approach from different perspectives. First, we analyze the performance portability provided by the high-level semantics 
-of SYCL, including the most recent group algorithms and subgroups of SYCL 2020. Second, we analyze how low-level aspects such as kernel occupancy and register pressure affect 
-the performance portability of the overall application. The experimental evaluation is performed on two different versions of LiGen, implementing two different parallelization 
-patterns, by comparing them with a manually optimized CUDA version, and by evaluating performance portability using both known and ad hoc metrics. The results show that,
-thanks to the combination of high-level SYCL semantics and some manual tuning, LiGen achieves native-comparable performance on NVIDIA, while also running on AMD GPUs.
+In recent years, there has been a growing interest in developing high-performance implementations of drug discovery
+processing software. To target modern GPU architectures, such applications are mostly written in proprietary languages
+such as CUDA or HIP. However, with the increasing heterogeneity of modern HPC systems and the availability of
+accelerators from multiple hardware vendors, it has become critical to be able to efficiently execute drug discovery
+pipelines on multiple large-scale computing systems, with the ultimate goal of working on urgent computing scenarios.
+This article presents the challenges of migrating LiGen, an industrial drug discovery software pipeline, from CUDA to
+the SYCL programming model, an industry standard based on C++ that enables heterogeneous computing. We perform a
+structured analysis of the performance portability of the SYCL LiGen platform, focusing on different aspects of the
+approach from different perspectives. First, we analyze the performance portability provided by the high-level semantics
+of SYCL, including the most recent group algorithms and subgroups of SYCL 2020. Second, we analyze how low-level aspects
+such as kernel occupancy and register pressure affect the performance portability of the overall application. The
+experimental evaluation is performed on two different versions of LiGen, implementing two different parallelization
+patterns, by comparing them with a manually optimized CUDA version, and by evaluating performance portability using both
+known and ad hoc metrics. The results show that, thanks to the combination of high-level SYCL semantics and some manual
+tuning, LiGen achieves native-comparable performance on NVIDIA, while also running on AMD GPUs.
diff --git a/content/videos/2024/2024-07-05-bring-your-code-to-riscv-accelerators-with-sycl.md b/content/videos/2024/2024-07-05-bring-your-code-to-riscv-accelerators-with-sycl.md
@@ -9,11 +9,10 @@ tags:
   - risc-v
 ---
 
-This talk will show attendees how to overcome proprietary code with RISC-V and SYCL. 
-They will learn how they can achieve code portability and adopt RISC-V hardware without 
-losing their existing work, for greater productivity.
+This talk will show attendees how to overcome proprietary code with RISC-V and SYCL. They will learn how they can
+achieve code portability and adopt RISC-V hardware without losing their existing work, for greater productivity.
 
-The talk will also highlight the ongoing research into pioneering applications for RISC-V, 
-funded by the EU Horizon programme. AERO and SYCLOPS are two such projects. AERO seeks 
-to enable the future heterogeneous EU cloud infrastructure, while SYCLOPS will bring 
-together the RISC-V and SYCL standards together into a single software stack for the first time.
+The talk will also highlight the ongoing research into pioneering applications for RISC-V, funded by the EU Horizon
+programme. AERO and SYCLOPS are two such projects. AERO seeks to enable the future heterogeneous EU cloud
+infrastructure, while SYCLOPS will bring together the RISC-V and SYCL standards together into a single software stack
+for the first time.