-
Notifications
You must be signed in to change notification settings - Fork 182
ENH: Array API dispatching #2096
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Array API dispatching #2096
Conversation
added array-api-compat to test env
|
sklearnex/utils/_array_api.py
Outdated
# Stricter NumPy-based Array API implementation. The | ||
# array_api_strict.Array instances always have a dummy "device" attribute. | ||
"array_api_strict", | ||
"dpctl.tensor", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be reverted, since stock scikit-learn testing not well designed for having input data being located on single target device.
sklearnex/utils/_array_api.py
Outdated
yield array_namespace | ||
|
||
|
||
def yield_namespace_device_dtype_combinations(include_numpy_namespaces=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just for testing were added. Will be removed due to https://github.com/intel/scikit-learn-intelex/pull/2096/files#r1792608548
added versioning for the get_nnamespace
…m/samir-nasibli/scikit-learn-intelex into enh/array_api_dispatching_testing
Co-authored-by: david-cortes-intel <david.cortes@intel.com>
/intelci: run |
Co-authored-by: david-cortes-intel <david.cortes@intel.com>
/intelci: run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@icfaust Just a few comments. Please ping me if you need explicit approval.
onedal_array_api: bool = False | ||
|
||
|
||
class oneDALEstimator: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great idea!
@@ -296,7 +296,7 @@ def test_memory_leaks(estimator, dataframe, queue, order, data_shape): | |||
|
|||
|
|||
@pytest.mark.skipif( | |||
os.getenv("ZES_ENABLE_SYSMAN") is None or not is_dpctl_device_available("gpu"), | |||
os.getenv("ZES_ENABLE_SYSMAN") is None or not is_dpctl_device_available(["gpu"]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh what a fail, glad you caught this.
-> #2438
Co-authored-by: Andreas Huber <9201869+ahuber21@users.noreply.github.com> Co-authored-by: david-cortes-intel <david.cortes@intel.com>
/intelci: run |
/intelci: run |
NOTE TO REVIEWERS: I have patched in a CI change in order to get CIs to run, #2440 must be brought in first, and the changes to the CI .yml must be removed |
def _asarray(data, xp, *args, **kwargs): | ||
"""Converted input object to array format of xp namespace provided.""" | ||
if hasattr(data, "__array_namespace__"): | ||
if hasattr(data, "__array_namespace__") or _supports_buffer_protocol(data): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should it be an or
or an and
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs to be an or
in this case (meaning it will work for numpy or other array api formats)
CI is now green and no objections from my side |
/intelci: run |
Description
This PR enables the interface for scikit-learn-intelex estimators to dispatch Array API inputs to oneDAL. This removes the need to move data to host and back in the case of on-device data (see the need behind
use_raw_input
) in properly enabled estimators. This PR will enablearray_api_dispatching
for sklearn >=1.2 . As it is experimental in sklearn versions, the direct use of sklearn functions may fail (see Ensemble estimators'apply
). Scikit-learn 1.2 does not have functional array checking with array_api_dispatch, so sklearnex/onedal array_api support with zero-copy will begin in earnest with sklearn 1.3. Enablement of the zero-copy will be for sklearn versions >=1.3 going forward.To accomplish this, the following related changes had to be made
oneDALEstimator
is created inbase.py
which combines the capabilities ofPatchableEstimator
andExtensionEstimator
. All sklearnex estimators must inherit this before sklearn'sBaseEstimator
in the mro. This is enforced by testing intest_common.py
. This is needed in order to provide a tag for signifying an Array_API supported sklearnex estimator without cluttering the inheritance of the estimators. Note: as a consequence the HTML docs support is limited to the range of sklearn versions which are also supported (>=1.4), and is the logic is rewritten to minimize maintenance. This required changes to TSNE as well.get_tags
is introduced to check for sklearn's tags and to minimize maintenance. This duck-types the capability of recent (>=1.6) changes to the tag system to older sklearn versions, simplifying the code. This is needed to check for array_api support and onedal array_api support via the tag system with minimal maintenance for older versions.oneDALEstimator
sklearnex/_device_offload.py
is refactored. This new logic changes the cumbersome passing of strings from_get_backend
into a boolean flag. A second checkover of inputs was removed from that function.dispatch
is extensively refactored in order to check for array_api support. The logic there is of particular note due to the increase in branching. It attempts at all times to minimize the checking or movement of data unless necessary, therefore requiring checking array_api support first (as this is data-movement free).test_standard_estimator_patching
andtest_special_estimator_patching
now reference a combined_check_estimator_patching
function to reduce code duplication.test_standard_estimator_patching
now includes array_api dispatching for the array-api-strict data type. Changes coming in dpnp 0.17 and dpctl 0.19 will make array_api_dispatch a necessity, requiring that for dataframe support for non-numpy, non-pandas inputs going forward. This is a significant task and must be done separately and also meaning further modification to the test suite.array_api_strict
inputs are now done withconfig_context(array_api_dispatch=True)
which verifies proper operation of sklearnex patched methods. This required changes toKMeans.transform
,IncrementalEmpiricalCovariance.mahalanobis
,IncrementalEmpiricalCovariance.score
, andKMeans.fit_transform
beyond the changes to the standard dispatcher.test_fallback_to_host
is added due to the complexity ofdispatch
. This validates that the various scenarios with offloading + fallbacks place the data properly while still preserving the queueThe following bugs needed to be solved to get things working properly:
is_dpctl_device_available
calls from test_memory_usage need to be a list input, not a string.transform_output
used in _device_offloadsupport_input_format
andwrap_output_data
need to check fordefault
and notNone
, the array_api logic should only apply for non-numpy inputs, but should be able to convert numpy results. previously this logic was always skippedallow_fallback_to_host
is corrected for when used in conjunction withtarget_offload
_asarray
does not work for older numpy versions, as arrays do not have the__array_namespace__
attribute. Following the guideline from the array_api standard, a check for the buffer protocol is added in line to match: https://github.com/data-apis/array-api-strict/blob/25cc3d7ff0069b222d228d380e6d95bbf9a5dbcf/array_api_strict/_creation_functions.py#L50.dtype
-having objects which will then flatten, as array_api_strict arrays are very picky with the use of__iter__
. This will simplify the analysis of 2d-array types, which have otherwise split them unnecessarily into 1d arrays for individual analysis.ElasticNet
andLasso
Additional cleanup changes
test_patching
is removedLogisticRegression
, as cleanup to the inheritance is undertaken in this PR.get_tags
insklearnex.utils
makes the functionget_array_api_support_tag
unnecessary in sklearnex/_device_offload and is removed. It also attempts to match sklearn public API calls for simplicity in development.BaseEstimator
is removed, as this is inherited from the various sklearnSVM
algorithms in inheritanceuse_raw_input
indispatch
is inverted and placed at the beginning for simplicity in analysis at no cost in computational time.As none of the
onedal/
estimators are array_api enabled yet, the code is not fully tested, and additional PRs will be necessary to properly test and verify array_api support. These are not simple PRs and are to be included later before any estimator becomes array_api enabled. The initial testing should be sufficient for simple array_api support verification, and prevent non-functional changes for array_api support from entering the codebase.Follow up work must be made for support extracting queues from array_api inputs which comply with the new sycl_queue management system. As of this PR, it will not properly understand array_api devices. This will modify a separate file and must be a follow-on PR.
It is likely that this PR needs to be separated based on individual themes contained within that were necessary to enable the dispatching. Help from the reviewers is greatly desired (and can be based solely on the description).
NOTE: The expansion of testing for array_api to use
config_context(array_api_dispatch=True)
has uncovered some bugs which were heretofore unnoticed. 1st the array_api dispatching ofIncrementalEmpiricalCovariance.score
for sklearn <1.3 is not possible, and has been deselected. 2ndPCA.score_sample
andPCA.score
do not currently work as the return values from PCA.fit (._precision and ._mean) need to be the same type as the inputX
values. As of now, the return values and estimator attributes are not properly checked for type matching, and conversions to non-numpy types from oneDAL is inconsistently implemented. This is a significant undertaking and must be done as a follow-up with the individual array_api support in oneDAL while maintaining quality code. This will also require the completion of__dlpack__
support to oneDAL table types, such thatfrom_table
can return all array_api types.Checklist to comply with before moving PR from draft:
PR completeness and readability
Testing
Performance