-
Notifications
You must be signed in to change notification settings - Fork 182
[CI, enhancement] add pytorch+gpu testing ci #2494
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
icfaust
wants to merge
65
commits into
uxlfoundation:main
Choose a base branch
from
icfaust:dev/pytorch_testing_CI
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Codecov ReportAttention: Patch coverage is
Flags with carried forward coverage won't be shown. Click here to find out more.
... and 15 files with indirect coverage changes 🚀 New features to boost your workflow:
|
9 tasks
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
This PR introduces a public GPU CI job to sklearnex. It is not fully featured but does provide first GPU testing publicly. Due to issues with n_jobs support (which are being addressed in #2364), run times are extremely long but viable. The GPU is only currently used in the sklearn conformance steps, not in sklearnex/onedal testing. This is because it tests without
dpctl
installed for GPU offloading. It will in the future extract queues from data in combination withPyTorch
, which has Intel GPU capabilities since PyTorch 2.4 (https://docs.pytorch.org/docs/stable/notes/get_start_xpu.html). This will allow GPU testing in the other steps.This CI is important for at least 3 reasons: sklearn tests array_api using CuPy, PyTorch and array_api_strict frameworks. PyTorch is the only non
__sycl_usm_array_interface__
GPU data framework which is expected to work for both sklearn and sklearnex. Therefore 1) it provides an array_api-only GPU testing framework to validate with sklearn conformance 2) it is likely the first entry point for users who which to use Intel GPU data natively (due to the user base size). 3) It validates that sklearnex can properly function without dpctl installed for GPU use, removing limitations on python versions and dependency stability issues. Note PyTorch DOES NOT FOLLOW THE ARRAY_API STANDARD, sklearn uses array_api_compat to shoe-horn in pytorch support. There are quirks associated with PyTorch and should be tested by sklearnex. This has impacts on how we design our estimators, as checking for__array_namespace__
is insufficient if we wish to support PyTorch.Unlike other public runners, it takes a strategy of splitting apart the build and test steps in to separate jobs. The test step occurs on a CPU-only runner and on a GPU runner at the same time. It does not use a virtual environment like Conda or venv for simplicity, however it can reuse all of the previously written infrastructure.
It uses Python 3.12 and sklearn 1.4 due to simplicity (i.e. to duplicate other GPU testing systems). This will be updated in a follow up PR as it becomes further used (likely requiring different deselections).
When successful, a large increase in code coverage should be observed in codecov, as code coverage is also made available.
This should be very important for validating array_api changes in the codebase coming soon, which would otherwise be obscured by
dpctl
.This required the following changes:
run_sklearn_tests.sh
were required to get the gpu deselections to work publicly.assert_all_finite
would fail in combination with array_api_dispatching, changes are made in daal4py, to only use DAAL in the case it is numpy or a dataframe. As PyTorch has a different use for thesize
attribute, changes needed to be made for it.__dlpack_device__
attribute instead, and then useasarray
if__array__
is available, orfrom_dlpack
because the__dlpack__
attribute is available. This required exposing some dlpack enums for verification.array-api-compat
and enabling array api conformance tests #2079)test_learning_curve_some_failing_fits_warning[42]
because of unknown issue with_intercept_
andSVC
on GPU. (Must be investigated)This will require the following PRs afterwards (by theme):
onedal/tests/utils/_dataframes_support.py
andonedal/tests/utils/_device_selection.py
to have public GPU testing in sklearnex.from_data
inonedal/utils/_sycl_queue_manager.py
to extract queues from__dlpack__
data (special PyTorch interface already in place in pybind11).torch
,dpnp
, anddpctl.tensor
due to load times in a centralized way (likely following strategy laid out inarray_api_compat
).SVC
and_intercept_
attribute for (test_learning_curve_some_failing_fits_warning[42]
sklearn conformance test)No performance benchmarks necessary
PR should start as a draft, then move to ready for review state after CI is passed and all applicable checkboxes are closed.
This approach ensures that reviewers don't spend extra time asking for regular requirements.
You can remove a checkbox as not applicable only if it doesn't relate to this PR in any way.
For example, PR with docs update doesn't require checkboxes for performance while PR with any change in actual code should have checkboxes and justify how this code change is expected to affect performance (or justification should be self-evident).
Checklist to comply with before moving PR from draft:
PR completeness and readability
Testing
Performance