Skip to content

Commit 10e1d00

Browse files
authored
[CI] Add GPU reset for BMG (#17381)
I talked to the KMD team and they recommended doing it this way for the `xe` module. Maybe this will improve stability. I hardcoded the PCI ID but we only have one Linux BMG runner so it should be fine for now. Confirmed this works by checking the dmesg log of the runner. --------- Signed-off-by: Sarnie, Nick <nick.sarnie@intel.com>
1 parent 57cece2 commit 10e1d00

File tree

6 files changed

+26
-10
lines changed

6 files changed

+26
-10
lines changed

.github/workflows/sycl-linux-precommit.yml

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -112,13 +112,12 @@ jobs:
112112
runner: '["Linux", "bmg"]'
113113
image_options: -u 1001 --device=/dev/dri -v /dev/dri/by-path:/dev/dri/by-path --privileged --cap-add SYS_ADMIN
114114
target_devices: level_zero:gpu
115-
# The new Xe kernel driver used by BMG doesn't support resetting.
116-
reset_intel_gpu: false
115+
reset_intel_gpu: true
117116
- name: SPIR-V Backend / Intel Battlemage Graphics
118117
runner: '["Linux", "bmg"]'
119118
image_options: -u 1001 --device=/dev/dri -v /dev/dri/by-path:/dev/dri/by-path --privileged --cap-add SYS_ADMIN
120119
target_devices: level_zero:gpu;opencl:gpu;opencl:cpu
121-
reset_intel_gpu: false
120+
reset_intel_gpu: true
122121
extra_lit_opts: --param spirv-backend=True
123122
e2e_binaries_artifact: sycl_e2e_bin_default_spirv_backend
124123
uses: ./.github/workflows/sycl-linux-run-tests.yml

.github/workflows/sycl-linux-run-tests.yml

Lines changed: 14 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -201,15 +201,22 @@ jobs:
201201
steps:
202202
- name: Reset Intel GPU
203203
if: inputs.reset_intel_gpu == 'true'
204+
shell: bash
204205
run: |
205-
sudo mount -t debugfs none /sys/kernel/debug
206-
base_dir="/sys/kernel/debug/dri"
206+
if [[ '${{ inputs.runner }}' == '["Linux", "bmg"]' ]]; then
207+
sudo bash -c 'echo 0000:05:00.0 > /sys/bus/pci/drivers/xe/unbind'
208+
sudo bash -c 'echo 1 > /sys/bus/pci/devices/0000:05:00.0/reset'
209+
sudo bash -c 'echo 0000:05:00.0 > /sys/bus/pci/drivers/xe/bind'
210+
else
211+
sudo mount -t debugfs none /sys/kernel/debug
212+
base_dir="/sys/kernel/debug/dri"
207213
208-
for dir in "$base_dir"/*; do
209-
if [ -f "$dir/i915_wedged" ]; then
210-
sudo bash -c 'echo 1 > $0/i915_wedged' $dir
211-
fi
212-
done
214+
for dir in "$base_dir"/*; do
215+
if [ -f "$dir/i915_wedged" ]; then
216+
sudo bash -c 'echo 1 > $0/i915_wedged' $dir
217+
fi
218+
done
219+
fi
213220
- uses: actions/checkout@v4
214221
with:
215222
ref: ${{ inputs.devops_ref || inputs.repo_ref }}

.github/workflows/sycl-nightly.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -82,6 +82,7 @@ jobs:
8282
runner: '["Linux", "bmg"]'
8383
image_options: -u 1001 --device=/dev/dri -v /dev/dri/by-path:/dev/dri/by-path --privileged --cap-add SYS_ADMIN
8484
target_devices: level_zero:gpu
85+
reset_intel_gpu: true
8586

8687
- name: Intel L0 Arc A-Series GPU
8788
runner: '["Linux", "arc"]'

sycl/test-e2e/KernelAndProgram/persistent-cache-multi-device.cpp

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,9 @@
99
// XFAIL: spirv-backend && run-mode
1010
// XFAIL-TRACKER: CMPLRLLVM-64705
1111

12+
// XFAIL: linux && arch-intel_gpu_bmg_g21 && !igc-dev && run-mode
13+
// XFAIL-TRACKER: https://github.com/intel/llvm/issues/17453
14+
1215
// Test checks that persistent cache works correctly with multiple devices.
1316

1417
#include <sycl/detail/core.hpp>

sycl/test-e2e/ProgramManager/multi_device_bundle/build_twice.cpp

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,9 @@
1010
// XFAIL: spirv-backend && run-mode
1111
// XFAIL-TRACKER: CMPLRLLVM-64705
1212

13+
// XFAIL: linux && arch-intel_gpu_bmg_g21 && !igc-dev && run-mode
14+
// XFAIL-TRACKER: https://github.com/intel/llvm/issues/17453
15+
1316
#include <sycl/detail/core.hpp>
1417
#include <sycl/kernel_bundle.hpp>
1518

sycl/test-e2e/ProgramManager/multi_device_bundle/device_libs_and_caching.cpp

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,9 @@
3232
// XFAIL: spirv-backend && run-mode
3333
// XFAIL-TRACKER: CMPLRLLVM-64705
3434

35+
// XFAIL: linux && arch-intel_gpu_bmg_g21 && !igc-dev && run-mode
36+
// XFAIL-TRACKER: https://github.com/intel/llvm/issues/17453
37+
3538
#include <cmath>
3639
#include <complex>
3740
#include <sycl/detail/core.hpp>

0 commit comments

Comments
 (0)