Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
64 changes: 63 additions & 1 deletion devops/actions/run-tests/benchmark/action.yml
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,7 @@ runs:
# TODO: in terms of security, is this overkill?
if [ -z "$(printf '%s' "$RUNNER_NAME" | grep -oE '^[a-zA-Z0-9_-]+$')" ]; then
echo "Bad runner name, please ensure runner name is [a-zA-Z0-9_-]."
echo "**Error:** Bad runner name '$RUNNER_NAME', please ensure runner name is [a-zA-Z0-9_-]." >> $GITHUB_STEP_SUMMARY
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove this if and fail right in above case default branch when RUNNER_TAG isn't PVC_PERF or BMG_PERF. Is there a case when RUNNER_TAG may be different than them?

exit 1
fi

Expand All @@ -73,7 +74,11 @@ runs:

# Make sure specified preset is a known value and is not malicious
python3 ./devops/scripts/benchmarks/presets.py query "$PRESET"
[ "$?" -ne 0 ] && exit 1 # Stop workflow if invalid preset
if [ "$?" -ne 0 ]; then
echo "Unknown preset $PRESET!"
echo "**Error:** Unknown benchmark preset '$PRESET'! Please see /devops/scripts/benchmarks/presets.py for available presets." >> $GITHUB_STEP_SUMMARY
exit 1 # Stop workflow if invalid preset
fi
echo "PRESET=$PRESET" >> $GITHUB_ENV
- name: Compute CPU core range to run benchmarks on
shell: bash
Expand Down Expand Up @@ -170,6 +175,8 @@ runs:
opencl:*) SAVE_SUFFIX="OCL" ;;
*) SAVE_SUFFIX="${ONEAPI_DEVICE_SELECTOR%%:*}";;
esac
# Reminder: MACHINE_TYPE determined here is used again below to ensure
# intended device is used for benchmarks.
case "$RUNNER_TAG" in
'["PVC_PERF"]') MACHINE_TYPE="PVC" ;;
'["BMG_PERF"]') MACHINE_TYPE="BMG" ;;
Expand All @@ -179,6 +186,61 @@ runs:
MACHINE_TYPE="${MACHINE_TYPE%_PERF=\"]}"
;;
esac

# Explicitly ensure we are using the intended device, not i.e. an
# integrated GPU on the runner.
#
# Note that we already use ZE_AFFINITY_MASK, but in the offchance the
# intended device isn't available, ZE_AFFINITY_MASK would result in the
# benchmarks running on an unwanted device.

echo "Searching for a $MACHINE_TYPE device..."
explicit_device_sel=""

sycl_ls_out="$(mktemp)"
Copy link
Contributor

@lslusarczyk lslusarczyk Oct 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am against adding this code in this way.
It is complex, not automatically tested and once we commit it we will be afraid of touching this code worrying of not breaking it.

Its goal is to produce right ONEAPI_DEVICE_SELECTOR for our two machines where we run tests on two GPUs.
The simplest solution will be just hardcode that selector here depending on MACHINE_TYPE or machine name. Simple, readable.

If we really need a code that automatically creates ONEAPI_DEVICE_SELECTOR out of MACHINE_TYPE and sycl-ls output then please let's create a testable code.

Below proposition in python. You may do similar in bash if you know how to nicely test it. I don't know, thus proposition in python.

  1. create a script get_oneapi_device_selector.py taking as argument MACHINE_TYPE
  2. inside create two functions:
  3. function get_oneapi_device_selector( sycl_ls_outpout : string, machine_type : string) returning string, and...
  4. function main executing sycl-ls and passing it to get_oneapi_device_selector
  5. write python doctests for get_oneapi_device_selector function. It will be a nice test as it takes strings and outputs string. They will also well document what this selector we expect to be having given input. Two cases will be enough - one with actual BMG sycl-ls output and one with PVC output. Of course we may add more (one card, two cards, no cards). Main function may be left untested - it just calls get_oneapi_device_selector function.
  6. make doctests that you created beging executed in our CI

This is (a bit) heavy, but there is too much technical debt in benchmarking scripts related to skipping unittests and I'd like to propose we do not produce this debt more. Once done this way we will have doctests working, which would be nice.

Again, I think just hardcoding selectors is good too (or even better). If you think a script is really a better solution, but we don't have time to write it in testable way now, then maybe hardcode selectors for now and leave TODO with pointer to this discussion and to your current commit, which we plan to turn into tested script in the future.

update: The easiest, (ugly, but acceptable because tested by doctests) way of writing get_oneapi_device_selector.py would be calling your bash code within get_oneapi_device_selector function. If in hurry, we may consider this way and refactor in the future. Refactor will be a pleasure, as we will already have tests :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its goal is to produce right ONEAPI_DEVICE_SELECTOR for our two machines where we run tests on two GPUs.
The simplest solution will be just hardcode that selector here depending on MACHINE_TYPE or machine name. Simple, readable.

I avoided hardcoding because:

  • In case numbering/output of sycl-ls changes (i.e., there are discussions right now for changes potentially changing numbering in sycl-ls), then hardcoding values could result in the wrong GPU being used
  • If a single device exists per backend, sycl-ls defaults to no device identifiers; one would just use e.g. level_zero:gpu. In the case that a runner is changed, such a value could also potentially result in the wrong GPU being used.

But, your concerns w.r.t. a lack of testing is entirely valid. A lack of testing outside of a precommit test-run of our code is not exactly reassuring. Given the slightly problematic nature of sycl-ls mentioned above and your objections, perhaps hardcoding to use SYCL_DEVICE_ALLOWLIST as Udit mentioned is a better and simpler solution instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah... the only way to ensure what device you are using with SYCL_DEVICE_ALLOWLIST is to rely on DeviceName, which is deprecated... If we were to go for the env vars path, I'd need to add this it ONEAPI_DEVICE_SELECTOR (and consult stakeholders, get approvals, etc.; unfortunately my internship is ending, and I won't have the time for this.)

The other solution is to turn this code into python instead; I will make the utilities and the doctests, but running these doctests in precommit will have to be a separate PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other solution is to turn this code into python instead; I will make the utilities and the doctests, but running these doctests in precommit will have to be a separate PR.

That would be great. Just run your tests manually right now. You or me or somebody else will add these tests to CI soon.

sycl-ls 2>/dev/null > "$sycl_ls_out"
while IFS= read -r device; do
# Separate device selector id from e.g. '[level_zero:gpu][level_zero:0]'
if [ -n "$(echo "$device" | grep -E '^\[[a-z_]+:[a-z]+\] ')" ]; then
# Only 1 device exists: [backend:dev] device
device_sel=${device%%] *}
device_sel=${device_sel#[}
elif [ -n "$(echo "$device" | grep -E '^\[[a-z_]+:[a-z]+\]\[[a-z_]+:[0-9]+\] ')" ]; then
# Multiple devices exist: [backend:dev][backend:id] device
device_sel=${device%%] *}
device_sel=${device_sel#*][}
else
echo "Unknown sycl-ls format: Expecting '[backend:dev] device' or '[backend:dev][backend:id] device'."
sycl-ls
echo "**Error:** Unknown sycl-ls format: Expecting '[backend:dev] device' or '[backend:dev][backend:id] device':" >> $GITHUB_STEP_SUMMARY
echo '```' >> $GITHUB_STEP_SUMMARY
sycl-ls >> $GITHUB_STEP_SUMMARY
echo '```' >> $GITHUB_STEP_SUMMARY
exit 1
fi
case "${device#* }" in
*'Data Center GPU Max'*) device_sel_type="PVC" ;;
*'B580 Graphics'*) device_sel_type="BMG" ;;
*) device_sel_type="unknown";;
esac
if [ "$device_sel_type" = "$MACHINE_TYPE" ]; then
echo "Device '$MACHINE_TYPE' found at $device_sel."
explicit_device_sel="$device_sel"
break
fi
done < "$sycl_ls_out"
if [ -z "$explicit_device_sel" ]; then
echo "Error: Unable to find MACHINE_TYPE '$MACHINE_TYPE' on runner!"
sycl-ls
echo "**Error:** Unable to find requested device '$MACHINE_TYPE' on runner $RUNNER_TAG! Devices available:" >> $GITHUB_STEP_SUMMARY
echo '```' >> $GITHUB_STEP_SUMMARY
sycl-ls >> $GITHUB_STEP_SUMMARY
echo '```' >> $GITHUB_STEP_SUMMARY
exit 1
fi
export ONEAPI_DEVICE_SELECTOR="$explicit_device_sel"

# Construct save name for benchmark results:
SAVE_NAME="${SAVE_PREFIX}_${MACHINE_TYPE}_${SAVE_SUFFIX}"
echo "SAVE_NAME=$SAVE_NAME" >> $GITHUB_ENV
SAVE_TIMESTAMP="$(date -u +'%Y%m%d_%H%M%S')" # Timestamps are in UTC time
Expand Down
Loading