-
Notifications
You must be signed in to change notification settings - Fork 791
[CI][Bench] Double-check that the device being used is the device we want #20383
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: sycl
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -56,6 +56,7 @@ runs: | |
# TODO: in terms of security, is this overkill? | ||
if [ -z "$(printf '%s' "$RUNNER_NAME" | grep -oE '^[a-zA-Z0-9_-]+$')" ]; then | ||
echo "Bad runner name, please ensure runner name is [a-zA-Z0-9_-]." | ||
echo "**Error:** Bad runner name '$RUNNER_NAME', please ensure runner name is [a-zA-Z0-9_-]." >> $GITHUB_STEP_SUMMARY | ||
exit 1 | ||
fi | ||
|
||
|
@@ -73,7 +74,11 @@ runs: | |
|
||
# Make sure specified preset is a known value and is not malicious | ||
python3 ./devops/scripts/benchmarks/presets.py query "$PRESET" | ||
[ "$?" -ne 0 ] && exit 1 # Stop workflow if invalid preset | ||
if [ "$?" -ne 0 ]; then | ||
echo "Unknown preset $PRESET!" | ||
echo "**Error:** Unknown benchmark preset '$PRESET'! Please see /devops/scripts/benchmarks/presets.py for available presets." >> $GITHUB_STEP_SUMMARY | ||
exit 1 # Stop workflow if invalid preset | ||
fi | ||
echo "PRESET=$PRESET" >> $GITHUB_ENV | ||
- name: Compute CPU core range to run benchmarks on | ||
shell: bash | ||
|
@@ -170,6 +175,8 @@ runs: | |
opencl:*) SAVE_SUFFIX="OCL" ;; | ||
*) SAVE_SUFFIX="${ONEAPI_DEVICE_SELECTOR%%:*}";; | ||
esac | ||
# Reminder: MACHINE_TYPE determined here is used again below to ensure | ||
# intended device is used for benchmarks. | ||
case "$RUNNER_TAG" in | ||
'["PVC_PERF"]') MACHINE_TYPE="PVC" ;; | ||
'["BMG_PERF"]') MACHINE_TYPE="BMG" ;; | ||
|
@@ -179,6 +186,61 @@ runs: | |
MACHINE_TYPE="${MACHINE_TYPE%_PERF=\"]}" | ||
;; | ||
esac | ||
|
||
# Explicitly ensure we are using the intended device, not i.e. an | ||
# integrated GPU on the runner. | ||
# | ||
# Note that we already use ZE_AFFINITY_MASK, but in the offchance the | ||
# intended device isn't available, ZE_AFFINITY_MASK would result in the | ||
# benchmarks running on an unwanted device. | ||
|
||
echo "Searching for a $MACHINE_TYPE device..." | ||
explicit_device_sel="" | ||
|
||
sycl_ls_out="$(mktemp)" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am against adding this code in this way. Its goal is to produce right ONEAPI_DEVICE_SELECTOR for our two machines where we run tests on two GPUs. If we really need a code that automatically creates ONEAPI_DEVICE_SELECTOR out of MACHINE_TYPE and sycl-ls output then please let's create a testable code. Below proposition in python. You may do similar in bash if you know how to nicely test it. I don't know, thus proposition in python.
This is (a bit) heavy, but there is too much technical debt in benchmarking scripts related to skipping unittests and I'd like to propose we do not produce this debt more. Once done this way we will have doctests working, which would be nice. Again, I think just hardcoding selectors is good too (or even better). If you think a script is really a better solution, but we don't have time to write it in testable way now, then maybe hardcode selectors for now and leave TODO with pointer to this discussion and to your current commit, which we plan to turn into tested script in the future. update: The easiest, (ugly, but acceptable because tested by doctests) way of writing get_oneapi_device_selector.py would be calling your bash code within get_oneapi_device_selector function. If in hurry, we may consider this way and refactor in the future. Refactor will be a pleasure, as we will already have tests :) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I avoided hardcoding because:
But, your concerns w.r.t. a lack of testing is entirely valid. A lack of testing outside of a precommit test-run of our code is not exactly reassuring. Given the slightly problematic nature of There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah... the only way to ensure what device you are using with The other solution is to turn this code into python instead; I will make the utilities and the doctests, but running these doctests in precommit will have to be a separate PR. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
That would be great. Just run your tests manually right now. You or me or somebody else will add these tests to CI soon. |
||
sycl-ls 2>/dev/null > "$sycl_ls_out" | ||
while IFS= read -r device; do | ||
# Separate device selector id from e.g. '[level_zero:gpu][level_zero:0]' | ||
if [ -n "$(echo "$device" | grep -E '^\[[a-z_]+:[a-z]+\] ')" ]; then | ||
# Only 1 device exists: [backend:dev] device | ||
device_sel=${device%%] *} | ||
device_sel=${device_sel#[} | ||
elif [ -n "$(echo "$device" | grep -E '^\[[a-z_]+:[a-z]+\]\[[a-z_]+:[0-9]+\] ')" ]; then | ||
# Multiple devices exist: [backend:dev][backend:id] device | ||
device_sel=${device%%] *} | ||
device_sel=${device_sel#*][} | ||
else | ||
echo "Unknown sycl-ls format: Expecting '[backend:dev] device' or '[backend:dev][backend:id] device'." | ||
sycl-ls | ||
echo "**Error:** Unknown sycl-ls format: Expecting '[backend:dev] device' or '[backend:dev][backend:id] device':" >> $GITHUB_STEP_SUMMARY | ||
echo '```' >> $GITHUB_STEP_SUMMARY | ||
sycl-ls >> $GITHUB_STEP_SUMMARY | ||
echo '```' >> $GITHUB_STEP_SUMMARY | ||
exit 1 | ||
fi | ||
case "${device#* }" in | ||
*'Data Center GPU Max'*) device_sel_type="PVC" ;; | ||
*'B580 Graphics'*) device_sel_type="BMG" ;; | ||
*) device_sel_type="unknown";; | ||
esac | ||
if [ "$device_sel_type" = "$MACHINE_TYPE" ]; then | ||
echo "Device '$MACHINE_TYPE' found at $device_sel." | ||
explicit_device_sel="$device_sel" | ||
break | ||
fi | ||
done < "$sycl_ls_out" | ||
if [ -z "$explicit_device_sel" ]; then | ||
echo "Error: Unable to find MACHINE_TYPE '$MACHINE_TYPE' on runner!" | ||
sycl-ls | ||
echo "**Error:** Unable to find requested device '$MACHINE_TYPE' on runner $RUNNER_TAG! Devices available:" >> $GITHUB_STEP_SUMMARY | ||
echo '```' >> $GITHUB_STEP_SUMMARY | ||
sycl-ls >> $GITHUB_STEP_SUMMARY | ||
echo '```' >> $GITHUB_STEP_SUMMARY | ||
exit 1 | ||
fi | ||
export ONEAPI_DEVICE_SELECTOR="$explicit_device_sel" | ||
|
||
# Construct save name for benchmark results: | ||
SAVE_NAME="${SAVE_PREFIX}_${MACHINE_TYPE}_${SAVE_SUFFIX}" | ||
echo "SAVE_NAME=$SAVE_NAME" >> $GITHUB_ENV | ||
SAVE_TIMESTAMP="$(date -u +'%Y%m%d_%H%M%S')" # Timestamps are in UTC time | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's remove this if and fail right in above case default branch when RUNNER_TAG isn't PVC_PERF or BMG_PERF. Is there a case when RUNNER_TAG may be different than them?