-
Notifications
You must be signed in to change notification settings - Fork 577
Enable SYCL NVIDIA and AMD backends #2192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Tested with ``` CXX=icpx CC=icx AR=llvm-ar ./build.sh -Dsycl=nvidia ``` on Ubuntu 24.04 with CUDA 12.9 and oneAPI 2025.1. The CUDA Compute Capability can be optionally specified with `-Dcc_cuda`. If not specified, the default CUDA target of the DPC++ compiler is used, which means SYCL device code is precompiled for the lowest supported CC. When executed on a GPU with different CC, it is recompiled at runtime for the specific architecture. In addition to meson.build changes, remove a redundand free(nullptr) causing crashes in the SYCL NVIDIA backend.
Tested with ``` CXX=icpx CC=icx AR=llvm-ar ./build.sh -Dsycl=amd -Damd_gfx=90a ``` on Ubuntu 22.04 with ROCm 6.3.3 and oneAPI 2025.1. The new amd_gfx option is required as DPC++ does not support Just-In-Time compilation for AMD GPU code. It has to be precompiled for the right architecture when building the application. Fix the SYCL AMD fp16 backend which missed calling the fp16 hipBLAS functions where needed. Also fix the hardcoded sub-group / warp / wavefront size of 32. Some AMD GPUs have wavefront size of 64 and this has to be used instead.
Looks good, thank you. I'll take a second look tomorrow before merging. BTW, do you have any performance numbers? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Enables SYCL support for both NVIDIA and AMD backends, fixes fp16 GEMM calls, and makes subgroup sizes dynamic.
- Added
amd_gfx
Meson option and configured HIPBLAS/CUBLAS flags inmeson.build
for AMD/NVIDIA targets. - Implemented
USE_HIPBLAS
paths in SYCL layers and GEMM routines, replacing single-precision calls with half-precision (hipblasHgemm
). - Introduced
SYCL_SUB_GROUP_SIZE
macro in common kernels and replaced hardcoded32
in subgroup attributes; removed a redundantsycl::free(nullptr)
.
Reviewed Changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.
Show a summary per file
File | Description |
---|---|
src/neural/backends/sycl/layers.cc.dp.cpp | Added USE_HIPBLAS branches for fp16 GEMM, fixed redundant free, corrected SYCL submit closure |
src/neural/backends/sycl/fp16_kernels.dp.cpp | Defined SYCL_SUB_GROUP_SIZE based on AMD vs NVIDIA, updated subgroup annotations |
src/neural/backends/sycl/common_kernels.dp.cpp | Defined SYCL_SUB_GROUP_SIZE and updated subgroup annotations for multiple kernels |
meson_options.txt | Introduced amd_gfx build option for specifying AMD GPU arch |
meson.build | Configured library deps and compiler/link flags for SYCL AMD/NVIDIA backends |
Comments suppressed due to low confidence (1)
src/neural/backends/sycl/layers.cc.dp.cpp:361
- [nitpick] Consider adding unit or integration tests for the new
USE_HIPBLAS
fp16 GEMM paths to validate correctness on AMD hardware and prevent regressions.
#elif defined(USE_HIPBLAS)
@@ -30,6 +30,12 @@ | |||
#include "winograd_helper.h" | |||
#include <cmath> | |||
|
|||
#if defined(__HIP_PLATFORM_AMD__) && (defined(__GFX9__) || defined(__GFX8__)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The SYCL_SUB_GROUP_SIZE
macro is defined locally here; since it’s duplicated in multiple files, consider extracting it to a shared header to avoid divergence and improve maintainability.
Copilot uses AI. Check for mistakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point, moved to sycl_common.h
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible that we may also need to set SYCL_SUB_GROUP_SIZE
for future architectures?
Then it may make sense to use this pattern:
#if __has_include("params_override.h")
#include "params_override.h"
#endif
#ifndef SYCL_SUB_GROUP_SIZE
#if defined(__HIP_PLATFORM_AMD__) && (defined(__GFX9__) || defined(__GFX8__))
#define SYCL_SUB_GROUP_SIZE 64
#else
#define SYCL_SUB_GROUP_SIZE 32
#endif
#endif
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you'd like an interesting read, the full context for this very specific definition is ROCm/ROCm#4121 where an AMD engineer recommended this claiming it "will work without needing revisiting in the foreseeable future". I also can't imagine the other major vendors not supporting a sub-group size of 32 any time soon. Supporting other SYCL devices than AMD/Intel/NVIDIA GPUs would require big changes to the code as it is currently, so I'm quite confident this won't be needed for now. I would suggest implementing this pattern if/when a need for this comes up, but if you think it's useful I don't mind adding it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems fine then.
@@ -30,6 +30,12 @@ | |||
#endif | |||
#include "winograd_helper.h" | |||
|
|||
#if defined(__HIP_PLATFORM_AMD__) && (defined(__GFX9__) || defined(__GFX8__)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] This defines SYCL_SUB_GROUP_SIZE
again; extracting it to a common header would reduce duplication and ensure consistency across kernels.
Copilot uses AI. Check for mistakes.
@@ -1807,7 +1815,20 @@ static void cublasXgemm(transpose_type transa, | |||
}); | |||
} | |||
#elif defined(USE_HIPBLAS) | |||
hipblasHandle_t handle = hipBlasContextManager::gethipBlasHandle_t(); | |||
hipblasHandle_t handle = hipBlasContextManager::gethipBlasHandle_t(); | |||
if (fp16) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The if (fp16)
blocks across multiple GEMM routines duplicate conversion and submission logic; extracting common fp16-path code into a helper could reduce repetition and simplify future updates.
Copilot uses AI. Check for mistakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refactoring the code structure is beyond the scope of this PR which brings minimal changes required to enable the backends while keeping everything else untouched. The new blocks in the HIP path follow the same style as the existing blocks for the CUDA path.
We’ve seen around 4000-5000 nodes/second with SYCL (fp32) on Intel Data Center GPU Max 1100, AMD MI210 and NVIDIA H100. We know that for the biggest and fastest GPUs the performance of the NVIDIA and AMD backends is limited by the CPU threading performance of the cuBLAS/hipBLAS task submissions. This can be mitigated with a “native command” SYCL extension and we’re currently working on integrating that into the Velocity-Bench version in oneapi-src/Velocity-Bench#98. We’d like to upstream this work once it’s merged and well tested in Velocity-Bench and this should bring the performance much closer to native CUDA/HIP code. |
Building on top of #2152 from @KateBlueSky and @borg323, fix issues in the SYCL NVIDIA and AMD backends, and add the build configuration to enable them.
Tested with oneAPI 2025.1 on Ubuntu 22.04/24.04 with two NVIDIA and two AMD GPU models (one workstation/gaming and one data centre model from each vendor). Tested with
./lc0 bench
using bothsycl
andsycl-fp16
backends and using the t3-512x15x16h-distill-swa-2767500 network.The build works with the commands:
The (already existing)
cc_cuda
setting is optional and if not specified, the default CUDA target of the DPC++ compiler is used, which means SYCL device code is precompiled for the lowest supported CC. When executed on a GPU with different CC, it is recompiled at runtime for the specific architecture.The new
amd_gfx
option is required as DPC++ does not support Just-In-Time compilation for AMD GPU code. It has to be precompiled for the right architecture when building the application.Code fixes include: