Skip to content

Commit 3337105

Browse files
authored
Merge pull request #5 from IntelPython/final-touches-for-scipy
Final touches to website content for SciPy 2024
2 parents 281617c + def97de commit 3337105

File tree

6 files changed

+51
-20
lines changed

6 files changed

+51
-20
lines changed

content/en/_index.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -18,10 +18,10 @@ title: Portable Data-Parallel Python Extensions with oneAPI
1818
<div class="lead text-center">
1919
<div class="mx-auto mb-5">
2020
<a class="btn btn-lg btn-secondary me-3 mb-4" href="https://IntelPython.github.io/portable-data-parallel-extensions-scipy-2024/docs/">
21-
First<i class="fa-solid fa-question ms-2 "></i>
21+
Get Started<i class="fa-solid fa-play ms-2"></i>
2222
</a>
23-
<a class="btn btn-lg btn-secondary me-3 mb-4" href="https://github.com/google/docsy-example">
24-
Demonstration<i class="fab fa-github ms-2 "></i>
23+
<a class="btn btn-lg btn-secondary me-3 mb-4" href="https://github.com/IntelPython/example-portable-data-parallel-extensions">
24+
Examples<i class="fab fa-github ms-2 "></i>
2525
</a>
2626
</div>
2727
</div>

content/en/docs/_index.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,4 +10,6 @@ by [Nikita Grigorian](https://github.com/ndgrigorian) and [Oleksandr Pavlyk](htt
1010

1111
This poster is intended to introduce writing portable data-parallel Python extensions using oneAPI.
1212

13-
We present several examples, starting with the basics of initializing a USM (universal shared memory) array, then a KDE (kernel density estimation) with pure DPC++/Sycl, then a KDE Python extension, and finally how to write a portable Python extension which uses oneMKL.
13+
We present several examples, starting with the basics of initializing a USM (unified shared memory) array, then a KDE (kernel density estimation) with pure DPC++/Sycl, then a KDE Python extension, and finally how to write a portable Python extension which uses oneMKL.
14+
15+
The examples can be found [here](https://github.com/IntelPython/example-portable-data-parallel-extensions).

content/en/docs/kde-cpp.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -61,7 +61,7 @@ for further summation by another kernel operating in a similar fashion.
6161
```
6262
6363
Such an approach, known as tree reduction, is implemented in ``kernel_density_esimation_temps`` function found in
64-
``"steps/kernel_density_estimation_cpp/kde.hpp"``.
64+
[``"steps/kernel_density_estimation_cpp/kde.hpp"``](https://github.com/IntelPython/example-portable-data-parallel-extensions/blob/main/steps/kernel_density_estimation_cpp/kde.hpp).
6565
6666
Use of temporary allocation can be avoided if each work-item atomically adds the value of the local sum to the
6767
appropriate zero-initialized location in the output array, as in implementation ``kernel_density_estimation_atomic_ref``
@@ -119,10 +119,10 @@ in the work-group without accessing the global memory. This could be done effici
119119
```
120120
121121
Complete implementation can be found in ``kernel_density_estimation_work_group_reduce_and_atomic_ref`` function
122-
in ``"steps/kernel_density_estimation_cpp/kde.hpp"``.
122+
in [``"steps/kernel_density_estimation_cpp/kde.hpp"``](https://github.com/IntelPython/example-portable-data-parallel-extensions/blob/main/steps/kernel_density_estimation_cpp/kde.hpp).
123123
124-
These implementations are called from C++ application ``"steps/kernel_density_estimation_cpp/app.cpp"``, which
124+
These implementations are called from C++ application [``"steps/kernel_density_estimation_cpp/app.cpp"``](https://github.com/IntelPython/example-portable-data-parallel-extensions/blob/main/steps/kernel_density_estimation_cpp/app.cpp), which
125125
samples data uniformly distributed over unit cuboid, and estimates the density using Kernel Density Estimation
126126
and spherically symmetric multivariate Gaussian probability density function as the kernel.
127127
128-
The application can be built using `CMake`, or `Meson`, please refer to [README](steps/kernel_density_estimation_cpp/README.md) document in that folder.
128+
The application can be built using `CMake`, or `Meson`, please refer to [README](https://github.com/IntelPython/example-portable-data-parallel-extensions/blob/main/steps/kernel_density_estimation_cpp/README.md) document in that folder.

content/en/docs/kde-python.md

Lines changed: 9 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ date: 2024-07-02
55
weight: 3
66
---
77

8-
Since SYCL builds on C++, we are going to use `pybind11` project to generate Python extension.
8+
Since SYCL builds on C++, we are going to use the `pybind11` project to generate a Python extension.
99
We also need Python objects to carry USM allocations of input and output data, such as `dpctl` ([Data Parallel Control](https://github.com/IntelPython/dpctl.git) Python package). The `dpctl` package also provides Python objects corresponding to DPC++ runtime objects:
1010

1111
| Python object | SYCL C++ object |
@@ -15,9 +15,9 @@ We also need Python objects to carry USM allocations of input and output data, s
1515
| ``dpctl.SyclContext`` | ``sycl::context`` |
1616
| ``dpctl.SyclEvent`` | ``sycl::event`` |
1717

18-
`dpctl` provides integration with `pybind11` supporting castings between `dpctl` Python objects and corresponding C++ SYCL classes listed in the table above. Furthermore, the integration provides C++ class ``dpctl::tensor::usm_ndarray`` which derives from ``pybind11::object``.
19-
It stores `dpctl.tensor.usm_ndarray` object and provides methods to query its attributes, such as data pointer, dimensionality, shape, strides
20-
and elemental type information.
18+
`dpctl` provides integration with `pybind11` supporting castings between `dpctl` Python objects and corresponding C++ SYCL classes listed in the table above. Furthermore, the integration provides the C++ class ``dpctl::tensor::usm_ndarray`` which derives from ``pybind11::object``.
19+
It stores the `dpctl.tensor.usm_ndarray` object and provides methods to query its attributes, such as data pointer, dimensionality, shape, strides
20+
and elemental type information. Underlying `dpctl.tensor.usm_ndarray` is a SYCL unified shared memory (USM) allocation. See the [SYCL standard](https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#sec:usm) or [dpctl.memory documentation](https://intelpython.github.io/dpctl/latest/api_reference/dpctl/memory.html#dpctl-memory-pyapi) for more details.
2121

2222
For illustration purpose, here is a sample extension source code:
2323

@@ -29,7 +29,9 @@ For illustration purpose, here is a sample extension source code:
2929
#include <vector>
3030

3131
sycl::event
32-
py_foo(dpctl::tensor::usm_ndarray inp, dpctl::tensor::usm_ndarray out, const std::vector<sycl::event> &deps) {
32+
py_foo(dpctl::tensor::usm_ndarray inp,
33+
dpctl::tensor::usm_ndarray out,
34+
const std::vector<sycl::event> &deps) {
3335
// validation steps skipped
3436

3537
// Execution queue is the queue associated with input arrays
@@ -98,12 +100,12 @@ of the host task a chance at execution.
98100
Of course, if USM memory is not managed by Python, it may be possible to avoid using GIL altogether.
99101
100102
An example of Python extension `"kde_sycl_ext"` that exposes kernel density estimation code from previous
101-
section can be found in `"steps/sycl_python_extension"` folder (see [README](steps/sycl_python_extension/README.md)).
103+
section can be found in [`"steps/sycl_python_extension"`](https://github.com/IntelPython/example-portable-data-parallel-extensions/tree/main/steps/sycl_python_extension) folder (see [README](https://github.com/IntelPython/example-portable-data-parallel-extensions/blob/main/steps/sycl_python_extension/README.md)).
102104
103105
The folder contains comparison between `dpctl`-based implementation of the KDE implementation following the NumPy
104106
implementation [above](#kde_numpy) and the dedicated C++ code:
105107
106-
```
108+
```bash
107109
KDE for n_sample = 1000000, n_est = 17, n_dim = 7, h = 0.05
108110
Result agreed.
109111
kde_dpctl took 0.3404452269896865 seconds

content/en/docs/oneMKL.md

Lines changed: 32 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -5,9 +5,18 @@ date: 2024-07-02
55
weight: 4
66
---
77

8-
Since `dpctl.tensor.usm_ndarray` is a Python object carrying a USM allocation, it is possible to write extensions which wrap `oneAPI Math Kernel Library Interfaces` ([oneMKL Interfaces](https://github.com/oneapi-src/oneMKL)) routines and then call them on the USM data underlying the `usm_ndarray` container from Python.
8+
Given a matrix \\(A\\), the QR decomposition of \\(A\\) is defined as the decomposition of \\(A\\) into the product of matrices \\(Q\\) and \\(R\\) such that \\(Q\\) is orthonormal and \\(R\\) is upper-triangular.
9+
10+
QR factorization is a common routine in more optimized LAPACK libraries, so rather than write and implement an algorithm ourselves, it would be preferable to find a suitable library routine.
11+
12+
Since `dpctl.tensor.usm_ndarray` is a Python object with an underlying USM allocation, it is possible to write extensions which wrap `oneAPI Math Kernel Library Interfaces` ([oneMKL Interfaces](https://github.com/oneapi-src/oneMKL)) USM routines and then call them on the `dpctl.tensor.usm_ndarray` from Python. These low-level routines can greatly improve the performance of an extension.
13+
14+
`oneMKL Interfaces` can be built to dispatch to a variety of backends including `cuBLAS` and `rocBLAS` (see [oneMKL Interfaces README](https://github.com/oneapi-src/oneMKL?tab=readme-ov-file#oneapi-math-kernel-library-onemkl-interfaces)). The [`portBLAS`](https://github.com/codeplaysoftware/portBLAS) backend is also notable as it is open-source and written in pure SYCL.
15+
16+
`oneMKL` routines are essentially wrappers for the same routine in an underlying backend library, depending on the targeted device. This means that the same code can be used for NVidia, AMD, and Intel devices, making it highly portable.
17+
18+
Looking to the `oneMKL` documentation on [`geqrf`](https://spec.oneapi.io/versions/latest/elements/oneMKL/source/domains/lapack/geqrf.html#geqrf-usm-version):
919

10-
For an example routine from the `oneMKL` documentation, take [`geqrf`](https://spec.oneapi.io/versions/latest/elements/oneMKL/source/domains/lapack/geqrf.html#geqrf-usm-version):
1120
```cpp
1221
namespace oneapi::mkl::lapack {
1322
cl::sycl::event geqrf(cl::sycl::queue &queue,
@@ -22,6 +31,25 @@ namespace oneapi::mkl::lapack {
2231
}
2332
```
2433
25-
The `pybind11` castings discussed in the previous section enable us to write a simple wrapper function for this routine with `dpctl::tensor::usm_ndarray` inputs and outputs, so long as we take the same precautions to avoid deadlocks. As a result, we can write the extension in much the same way as the `kde_sycl_ext` extension in the previous chapter.
34+
This general format (``sycl::queue``, arguments, and a vector of ``sycl::event``s) is more or less the same throughout the `oneMKL` USM routines.
35+
36+
The `pybind11` castings discussed in the previous section enable us to write a simple wrapper function for this routine with ``dpctl::tensor::usm_ndarray`` inputs and outputs, so long as we take the same precautions to avoid deadlocks. As a result, we can write the extension in much the same way as the `"kde_sycl_ext"` extension in the previous chapter.
2637
27-
An example of a Python extension "mkl_interface_ext" that uses `oneMKL` calls to implement a QR decomposition can be found in "steps/mkl_interface" folder (see [README](steps/mkl_interface/README.md)).
38+
An example of a Python extension `"mkl_interface_ext"` that uses `oneMKL` calls to implement a QR decomposition can be found in [`"steps/mkl_interface"`](https://github.com/IntelPython/example-portable-data-parallel-extensions/tree/main/steps/mkl_interface) folder (see [README](https://github.com/IntelPython/example-portable-data-parallel-extensions/blob/main/steps/mkl_interface/README.md)).
39+
40+
The folder executes the tests found in [`"steps/mkl_interface/tests"`](https://github.com/IntelPython/example-portable-data-parallel-extensions/tree/main/steps/mkl_interface/tests) as well as running a larger benchmark which compares Numpy's `linalg.qr` (for reference) to the extension's implementation:
41+
42+
```bash
43+
$ python run.py
44+
Using device NVIDIA GeForce GT 1030
45+
================================================= test session starts ==================================================
46+
collected 8 items
47+
48+
tests/test_qr.py ........ [100%]
49+
50+
================================================== 8 passed in 0.45s ===================================================
51+
QR decomposition for matrix of size = (3000, 3000)
52+
Result agreed.
53+
qr took 0.016026005148887634 seconds
54+
np.linalg.qr took 0.5165981948375702 seconds
55+
```

layouts/404.html

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,5 @@
22
<div class="td-content">
33
<h1>Not found</h1>
44
<p>Oops! This page doesn't exist. Try going back to the <a href="{{ "" | relURL }}">home page</a>.</p>
5-
<p>You can learn how to make a 404 page like this in <a href="https://gohugo.io/templates/404/">Custom 404 Pages</a>.</p>
65
</div>
76
{{- end }}

0 commit comments

Comments
 (0)