Enable SYCL NVIDIA and AMD backends #2192

rafbiels · 2025-06-19T20:06:32Z

Building on top of #2152 from @KateBlueSky and @borg323, fix issues in the SYCL NVIDIA and AMD backends, and add the build configuration to enable them.

Tested with oneAPI 2025.1 on Ubuntu 22.04/24.04 with two NVIDIA and two AMD GPU models (one workstation/gaming and one data centre model from each vendor). Tested with ./lc0 bench using both sycl and sycl-fp16 backends and using the t3-512x15x16h-distill-swa-2767500 network.

The build works with the commands:

CXX=icpx CC=icx AR=llvm-ar ./build.sh -Dsycl=nvidia -Dcc_cuda=80

CXX=icpx CC=icx AR=llvm-ar ./build.sh -Dsycl=amd -Damd_gfx=90a

The (already existing) cc_cuda setting is optional and if not specified, the default CUDA target of the DPC++ compiler is used, which means SYCL device code is precompiled for the lowest supported CC. When executed on a GPU with different CC, it is recompiled at runtime for the specific architecture.

The new amd_gfx option is required as DPC++ does not support Just-In-Time compilation for AMD GPU code. It has to be precompiled for the right architecture when building the application.

Code fixes include:

Remove a redundant free(nullptr) causing crashes in the SYCL NVIDIA backend.
Fix the SYCL AMD fp16 backend which missed calling the fp16 hipBLAS functions where needed.
Fix the hardcoded sub-group / warp / wavefront size of 32. Some AMD GPUs have wavefront size of 64 and this has to be used instead.

Tested with ``` CXX=icpx CC=icx AR=llvm-ar ./build.sh -Dsycl=nvidia ``` on Ubuntu 24.04 with CUDA 12.9 and oneAPI 2025.1. The CUDA Compute Capability can be optionally specified with `-Dcc_cuda`. If not specified, the default CUDA target of the DPC++ compiler is used, which means SYCL device code is precompiled for the lowest supported CC. When executed on a GPU with different CC, it is recompiled at runtime for the specific architecture. In addition to meson.build changes, remove a redundand free(nullptr) causing crashes in the SYCL NVIDIA backend.

Tested with ``` CXX=icpx CC=icx AR=llvm-ar ./build.sh -Dsycl=amd -Damd_gfx=90a ``` on Ubuntu 22.04 with ROCm 6.3.3 and oneAPI 2025.1. The new amd_gfx option is required as DPC++ does not support Just-In-Time compilation for AMD GPU code. It has to be precompiled for the right architecture when building the application. Fix the SYCL AMD fp16 backend which missed calling the fp16 hipBLAS functions where needed. Also fix the hardcoded sub-group / warp / wavefront size of 32. Some AMD GPUs have wavefront size of 64 and this has to be used instead.

borg323 · 2025-06-19T21:25:06Z

Looks good, thank you. I'll take a second look tomorrow before merging.

BTW, do you have any performance numbers?

Copilot

Pull Request Overview

Enables SYCL support for both NVIDIA and AMD backends, fixes fp16 GEMM calls, and makes subgroup sizes dynamic.

Added amd_gfx Meson option and configured HIPBLAS/CUBLAS flags in meson.build for AMD/NVIDIA targets.
Implemented USE_HIPBLAS paths in SYCL layers and GEMM routines, replacing single-precision calls with half-precision (hipblasHgemm).
Introduced SYCL_SUB_GROUP_SIZE macro in common kernels and replaced hardcoded 32 in subgroup attributes; removed a redundant sycl::free(nullptr).

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
src/neural/backends/sycl/layers.cc.dp.cpp	Added `USE_HIPBLAS` branches for fp16 GEMM, fixed redundant free, corrected SYCL submit closure
src/neural/backends/sycl/fp16_kernels.dp.cpp	Defined `SYCL_SUB_GROUP_SIZE` based on AMD vs NVIDIA, updated subgroup annotations
src/neural/backends/sycl/common_kernels.dp.cpp	Defined `SYCL_SUB_GROUP_SIZE` and updated subgroup annotations for multiple kernels
meson_options.txt	Introduced `amd_gfx` build option for specifying AMD GPU arch
meson.build	Configured library deps and compiler/link flags for SYCL AMD/NVIDIA backends

Comments suppressed due to low confidence (1)

src/neural/backends/sycl/layers.cc.dp.cpp:361

[nitpick] Consider adding unit or integration tests for the new USE_HIPBLAS fp16 GEMM paths to validate correctness on AMD hardware and prevent regressions.

    #elif defined(USE_HIPBLAS)

Copilot · 2025-06-19T23:23:04Z

src/neural/backends/sycl/common_kernels.dp.cpp

@@ -30,6 +30,12 @@
 #include "winograd_helper.h"
 #include <cmath>

+#if defined(__HIP_PLATFORM_AMD__) && (defined(__GFX9__) || defined(__GFX8__))


[nitpick] The SYCL_SUB_GROUP_SIZE macro is defined locally here; since it’s duplicated in multiple files, consider extracting it to a shared header to avoid divergence and improve maintainability.

good point, moved to sycl_common.h

Is it possible that we may also need to set SYCL_SUB_GROUP_SIZE for future architectures?
Then it may make sense to use this pattern:

#if __has_include("params_override.h") #include "params_override.h" #endif #ifndef SYCL_SUB_GROUP_SIZE #if defined(__HIP_PLATFORM_AMD__) && (defined(__GFX9__) || defined(__GFX8__)) #define SYCL_SUB_GROUP_SIZE 64 #else #define SYCL_SUB_GROUP_SIZE 32 #endif #endif

If you'd like an interesting read, the full context for this very specific definition is ROCm/ROCm#4121 where an AMD engineer recommended this claiming it "will work without needing revisiting in the foreseeable future". I also can't imagine the other major vendors not supporting a sub-group size of 32 any time soon. Supporting other SYCL devices than AMD/Intel/NVIDIA GPUs would require big changes to the code as it is currently, so I'm quite confident this won't be needed for now. I would suggest implementing this pattern if/when a need for this comes up, but if you think it's useful I don't mind adding it.

Seems fine then.

Copilot · 2025-06-19T23:23:04Z

src/neural/backends/sycl/fp16_kernels.dp.cpp

@@ -30,6 +30,12 @@
 #endif
 #include "winograd_helper.h"

+#if defined(__HIP_PLATFORM_AMD__) && (defined(__GFX9__) || defined(__GFX8__))


[nitpick] This defines SYCL_SUB_GROUP_SIZE again; extracting it to a common header would reduce duplication and ensure consistency across kernels.

Copilot · 2025-06-19T23:23:05Z

src/neural/backends/sycl/layers.cc.dp.cpp

@@ -1807,7 +1815,20 @@ static void cublasXgemm(transpose_type transa,
      });
  }
  #elif defined(USE_HIPBLAS)
-    hipblasHandle_t handle = hipBlasContextManager::gethipBlasHandle_t();
+  hipblasHandle_t handle = hipBlasContextManager::gethipBlasHandle_t();
+  if (fp16) {


[nitpick] The if (fp16) blocks across multiple GEMM routines duplicate conversion and submission logic; extracting common fp16-path code into a helper could reduce repetition and simplify future updates.

Refactoring the code structure is beyond the scope of this PR which brings minimal changes required to enable the backends while keeping everything else untouched. The new blocks in the HIP path follow the same style as the existing blocks for the CUDA path.

rafbiels · 2025-06-20T19:34:40Z

BTW, do you have any performance numbers?

We’ve seen around 4000-5000 nodes/second with SYCL (fp32) on Intel Data Center GPU Max 1100, AMD MI210 and NVIDIA H100. We know that for the biggest and fastest GPUs the performance of the NVIDIA and AMD backends is limited by the CPU threading performance of the cuBLAS/hipBLAS task submissions. This can be mitigated with a “native command” SYCL extension and we’re currently working on integrating that into the Velocity-Bench version in oneapi-src/Velocity-Bench#98. We’d like to upstream this work once it’s merged and well tested in Velocity-Bench and this should bring the performance much closer to native CUDA/HIP code.

rafbiels added 2 commits June 18, 2025 13:01

borg323 requested a review from Copilot June 19, 2025 23:21

Copilot AI reviewed Jun 19, 2025

View reviewed changes

Move SYCL_SUB_GROUP_SIZE definition to a common header

99fd2e8

borg323 approved these changes Jun 20, 2025

View reviewed changes

borg323 merged commit 0c6deba into LeelaChessZero:master Jun 20, 2025
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enable SYCL NVIDIA and AMD backends #2192

Enable SYCL NVIDIA and AMD backends #2192

Uh oh!

rafbiels commented Jun 19, 2025 •

edited

Loading

Uh oh!

borg323 commented Jun 19, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jun 19, 2025

Uh oh!

rafbiels Jun 20, 2025

Uh oh!

borg323 Jun 20, 2025

Uh oh!

rafbiels Jun 20, 2025

Uh oh!

borg323 Jun 20, 2025

Uh oh!

Copilot AI Jun 19, 2025

Uh oh!

Copilot AI Jun 19, 2025

Uh oh!

rafbiels Jun 20, 2025

Uh oh!

rafbiels commented Jun 20, 2025

Uh oh!

Uh oh!

Uh oh!

Enable SYCL NVIDIA and AMD backends #2192

Enable SYCL NVIDIA and AMD backends #2192

Uh oh!

Conversation

rafbiels commented Jun 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

borg323 commented Jun 19, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Jun 19, 2025

Choose a reason for hiding this comment

Uh oh!

rafbiels Jun 20, 2025

Choose a reason for hiding this comment

Uh oh!

borg323 Jun 20, 2025

Choose a reason for hiding this comment

Uh oh!

rafbiels Jun 20, 2025

Choose a reason for hiding this comment

Uh oh!

borg323 Jun 20, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jun 19, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jun 19, 2025

Choose a reason for hiding this comment

Uh oh!

rafbiels Jun 20, 2025

Choose a reason for hiding this comment

Uh oh!

rafbiels commented Jun 20, 2025

Uh oh!

Uh oh!

Uh oh!

rafbiels commented Jun 19, 2025 •

edited

Loading