Skip to content

Enable SYCL NVIDIA and AMD backends #2192

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jun 20, 2025

Conversation

rafbiels
Copy link
Contributor

@rafbiels rafbiels commented Jun 19, 2025

Building on top of #2152 from @KateBlueSky and @borg323, fix issues in the SYCL NVIDIA and AMD backends, and add the build configuration to enable them.

Tested with oneAPI 2025.1 on Ubuntu 22.04/24.04 with two NVIDIA and two AMD GPU models (one workstation/gaming and one data centre model from each vendor). Tested with ./lc0 bench using both sycl and sycl-fp16 backends and using the t3-512x15x16h-distill-swa-2767500 network.

The build works with the commands:

CXX=icpx CC=icx AR=llvm-ar ./build.sh -Dsycl=nvidia -Dcc_cuda=80
CXX=icpx CC=icx AR=llvm-ar ./build.sh -Dsycl=amd -Damd_gfx=90a

The (already existing) cc_cuda setting is optional and if not specified, the default CUDA target of the DPC++ compiler is used, which means SYCL device code is precompiled for the lowest supported CC. When executed on a GPU with different CC, it is recompiled at runtime for the specific architecture.

The new amd_gfx option is required as DPC++ does not support Just-In-Time compilation for AMD GPU code. It has to be precompiled for the right architecture when building the application.

Code fixes include:

  • Remove a redundant free(nullptr) causing crashes in the SYCL NVIDIA backend.
  • Fix the SYCL AMD fp16 backend which missed calling the fp16 hipBLAS functions where needed.
  • Fix the hardcoded sub-group / warp / wavefront size of 32. Some AMD GPUs have wavefront size of 64 and this has to be used instead.

rafbiels added 2 commits June 18, 2025 13:01
Tested with
```
CXX=icpx CC=icx AR=llvm-ar ./build.sh -Dsycl=nvidia
```
on Ubuntu 24.04 with CUDA 12.9 and oneAPI 2025.1.

The CUDA Compute Capability can be optionally specified with `-Dcc_cuda`.
If not specified, the default CUDA target of the DPC++ compiler is used,
which means SYCL device code is precompiled for the lowest supported CC.
When executed on a GPU with different CC, it is recompiled at runtime
for the specific architecture.

In addition to meson.build changes, remove a redundand free(nullptr)
causing crashes in the SYCL NVIDIA backend.
Tested with
```
CXX=icpx CC=icx AR=llvm-ar ./build.sh -Dsycl=amd -Damd_gfx=90a
```
on Ubuntu 22.04 with ROCm 6.3.3 and oneAPI 2025.1.

The new amd_gfx option is required as DPC++ does not support
Just-In-Time compilation for AMD GPU code. It has to be precompiled
for the right architecture when building the application.

Fix the SYCL AMD fp16 backend which missed calling the fp16 hipBLAS
functions where needed.

Also fix the hardcoded sub-group / warp / wavefront size of 32. Some
AMD GPUs have wavefront size of 64 and this has to be used instead.
@borg323
Copy link
Member

borg323 commented Jun 19, 2025

Looks good, thank you. I'll take a second look tomorrow before merging.

BTW, do you have any performance numbers?

@borg323 borg323 requested a review from Copilot June 19, 2025 23:21
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Enables SYCL support for both NVIDIA and AMD backends, fixes fp16 GEMM calls, and makes subgroup sizes dynamic.

  • Added amd_gfx Meson option and configured HIPBLAS/CUBLAS flags in meson.build for AMD/NVIDIA targets.
  • Implemented USE_HIPBLAS paths in SYCL layers and GEMM routines, replacing single-precision calls with half-precision (hipblasHgemm).
  • Introduced SYCL_SUB_GROUP_SIZE macro in common kernels and replaced hardcoded 32 in subgroup attributes; removed a redundant sycl::free(nullptr).

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
src/neural/backends/sycl/layers.cc.dp.cpp Added USE_HIPBLAS branches for fp16 GEMM, fixed redundant free, corrected SYCL submit closure
src/neural/backends/sycl/fp16_kernels.dp.cpp Defined SYCL_SUB_GROUP_SIZE based on AMD vs NVIDIA, updated subgroup annotations
src/neural/backends/sycl/common_kernels.dp.cpp Defined SYCL_SUB_GROUP_SIZE and updated subgroup annotations for multiple kernels
meson_options.txt Introduced amd_gfx build option for specifying AMD GPU arch
meson.build Configured library deps and compiler/link flags for SYCL AMD/NVIDIA backends
Comments suppressed due to low confidence (1)

src/neural/backends/sycl/layers.cc.dp.cpp:361

  • [nitpick] Consider adding unit or integration tests for the new USE_HIPBLAS fp16 GEMM paths to validate correctness on AMD hardware and prevent regressions.
    #elif defined(USE_HIPBLAS)

@@ -30,6 +30,12 @@
#include "winograd_helper.h"
#include <cmath>

#if defined(__HIP_PLATFORM_AMD__) && (defined(__GFX9__) || defined(__GFX8__))
Copy link
Preview

Copilot AI Jun 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The SYCL_SUB_GROUP_SIZE macro is defined locally here; since it’s duplicated in multiple files, consider extracting it to a shared header to avoid divergence and improve maintainability.

Copilot uses AI. Check for mistakes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, moved to sycl_common.h

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible that we may also need to set SYCL_SUB_GROUP_SIZE for future architectures?
Then it may make sense to use this pattern:

#if __has_include("params_override.h")
#include "params_override.h"
#endif

#ifndef SYCL_SUB_GROUP_SIZE
#if defined(__HIP_PLATFORM_AMD__) && (defined(__GFX9__) || defined(__GFX8__))
#define SYCL_SUB_GROUP_SIZE 64
#else
#define SYCL_SUB_GROUP_SIZE 32
#endif
#endif

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you'd like an interesting read, the full context for this very specific definition is ROCm/ROCm#4121 where an AMD engineer recommended this claiming it "will work without needing revisiting in the foreseeable future". I also can't imagine the other major vendors not supporting a sub-group size of 32 any time soon. Supporting other SYCL devices than AMD/Intel/NVIDIA GPUs would require big changes to the code as it is currently, so I'm quite confident this won't be needed for now. I would suggest implementing this pattern if/when a need for this comes up, but if you think it's useful I don't mind adding it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems fine then.

@@ -30,6 +30,12 @@
#endif
#include "winograd_helper.h"

#if defined(__HIP_PLATFORM_AMD__) && (defined(__GFX9__) || defined(__GFX8__))
Copy link
Preview

Copilot AI Jun 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] This defines SYCL_SUB_GROUP_SIZE again; extracting it to a common header would reduce duplication and ensure consistency across kernels.

Copilot uses AI. Check for mistakes.

@@ -1807,7 +1815,20 @@ static void cublasXgemm(transpose_type transa,
});
}
#elif defined(USE_HIPBLAS)
hipblasHandle_t handle = hipBlasContextManager::gethipBlasHandle_t();
hipblasHandle_t handle = hipBlasContextManager::gethipBlasHandle_t();
if (fp16) {
Copy link
Preview

Copilot AI Jun 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The if (fp16) blocks across multiple GEMM routines duplicate conversion and submission logic; extracting common fp16-path code into a helper could reduce repetition and simplify future updates.

Copilot uses AI. Check for mistakes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactoring the code structure is beyond the scope of this PR which brings minimal changes required to enable the backends while keeping everything else untouched. The new blocks in the HIP path follow the same style as the existing blocks for the CUDA path.

@rafbiels
Copy link
Contributor Author

BTW, do you have any performance numbers?

We’ve seen around 4000-5000 nodes/second with SYCL (fp32) on Intel Data Center GPU Max 1100, AMD MI210 and NVIDIA H100. We know that for the biggest and fastest GPUs the performance of the NVIDIA and AMD backends is limited by the CPU threading performance of the cuBLAS/hipBLAS task submissions. This can be mitigated with a “native command” SYCL extension and we’re currently working on integrating that into the Velocity-Bench version in oneapi-src/Velocity-Bench#98. We’d like to upstream this work once it’s merged and well tested in Velocity-Bench and this should bring the performance much closer to native CUDA/HIP code.

@borg323 borg323 merged commit 0c6deba into LeelaChessZero:master Jun 20, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants