Skip to content

Commit 281617c

Browse files
authored
Merge pull request #4 from IntelPython/fill-out-website
Fill in website
2 parents 087db17 + dcf959f commit 281617c

File tree

7 files changed

+363
-124
lines changed

7 files changed

+363
-124
lines changed

content/en/_index.md

Lines changed: 1 addition & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -17,15 +17,12 @@ title: Portable Data-Parallel Python Extensions with oneAPI
1717
</div>
1818
<div class="lead text-center">
1919
<div class="mx-auto mb-5">
20-
<a class="btn btn-lg btn-secondary me-3 mb-4" href="https://github.com/google/docsy-example">
20+
<a class="btn btn-lg btn-secondary me-3 mb-4" href="https://IntelPython.github.io/portable-data-parallel-extensions-scipy-2024/docs/">
2121
First<i class="fa-solid fa-question ms-2 "></i>
2222
</a>
2323
<a class="btn btn-lg btn-secondary me-3 mb-4" href="https://github.com/google/docsy-example">
2424
Demonstration<i class="fab fa-github ms-2 "></i>
2525
</a>
26-
<a class="btn btn-lg btn-secondary me-3 mb-4" href="https://github.com/google/docsy-example">
27-
About<i class="fa-solid fa-address-card ms-2 "></i>
28-
</a>
2926
</div>
3027
</div>
3128
</div>

content/en/docs/first-app.md

Lines changed: 93 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,102 @@
11
---
22
title: First DPC++ app
33
description: A SYCL and DPC++ "Hello, World!" example.
4-
date: 2017-01-05
4+
date: 2024-07-02
55
weight: 2
66
---
77

8-
{{% pageinfo %}}
8+
For an in-depth introduction to SYCL and to accelerators programming please refer to the "[Data Parallel C++](https://link.springer.com/book/10.1007/978-1-4842-9691-2)" open access e-book.
99

10-
This is a placeholder page that shows you how to use this template site.
10+
A SYCL application runs on SYCL platform (host, connected to one or more heterogeneous devices). The application is structured in three scopes: application scope, command group scope, and kernel scope. The kernel scope specifies a single kernel function that will be compiled by the device
11+
compiler and executed on the device. The command group scope specifies a unit work which includes the kernel function, preparation of
12+
its arguments and specifying execution ordering information. The application scope specifies all the other code outside of command group scope.
13+
Execution of SYCL application begins in the application scope.
1114

12-
{{% /pageinfo %}}
15+
```cpp
16+
// Compile: icpx -fsycl first.cpp -o first
17+
#include <sycl/sycl.hpp>
1318

14-
Do you have any example **applications** or **code** for your users in your repo
15-
or elsewhere? Link to your examples here.
19+
int main(void) {
20+
// queue to enqueue work to
21+
// default-selected device
22+
sycl::queue q{sycl::default_selector_v};
23+
24+
// allocation device
25+
size_t data_size = 256;
26+
int *data = sycl::malloc_device<int>(data_size, q);
27+
28+
// submit a task to populate
29+
// device allocation
30+
sycl::event e_fill =
31+
q.fill<int>(data, 42, data_size); // built-in kernel
32+
33+
// submit kernel to modify device allocation
34+
sycl::event e_comp =
35+
q.submit([&](sycl::handler &cgh) { // command-group scope
36+
// order execution after
37+
// fill task completes
38+
cgh.depends_on(e_fill);
39+
40+
sycl::range<1> global_iter_range{data_size};
41+
cgh.parallel_for(
42+
global_iter_range,
43+
[=](sycl::item<1> it) { // kernel scope
44+
int i = it.get_id(0);
45+
data[i] += i;
46+
}
47+
);
48+
});
49+
50+
// copy from device to host
51+
// order execution after modification task completes
52+
int *host_data = new int[data_size];
53+
54+
q.copy<int>( // built-in kernel
55+
data, host_data, data_size, {e_comp}
56+
).wait();
57+
sycl::free(data, q);
58+
59+
// Output content of the array
60+
output_array(host_data, data_size);
61+
delete[] host_data;
62+
63+
return 0;
64+
}
65+
```
66+
67+
The device where the kernel functions is executed is controlled by a device selector function, ``sycl::default_selector_v``.
68+
The default selector assigns scores to every device recognized by the runtime, and selects the one with the highest score.
69+
A list of devices recognized by the DPC++ runtime can be obtained by running ``sycl-ls`` command.
70+
71+
A user of SYCL application compiled with DPC++ may restrict the set of devices discoverable by the runtime using
72+
``ONEAPI_DEVICE_SELECTOR`` environment variable. For example:
73+
74+
```bash
75+
# execute on GPU
76+
ONEAPI_DEVICE_SELECTOR=*:gpu ./first
77+
# execute on CPU
78+
ONEAPI_DEVICE_SELECTOR=*:cpu ./first
79+
```
80+
81+
By default, DPC++ compiler generates offload code for [SPIR64](https://www.khronos.org/spir/) SYCL target, supported by
82+
Intel GPUs as well as by CPU devices of x86_64 architecture. An attempt to execute SYCL program while
83+
selecting only devices that do not support SPIR language would result in an error.
84+
85+
### Targeting other GPUs
86+
87+
DPC++ supports generation of offload sections for multiple targets. For example, to compile for both SPIR and NVPTX targets (oneAPI for NVidia(R) GPUs is assumed installed):
88+
89+
```bash
90+
icpx -fsycl -Xsycl-targets="nvptx64-nvidia-cuda,spir64-unknown-unknown" first.cpp -o first.out
91+
```
92+
93+
To compile for both SPIR and AMD GCN targets (oneAPI for AMD GPUs is assumed installed):
94+
95+
```bash
96+
icpx -fsycl -Xsycl-targets="amdgcn-amd-amdhsa,spir64-unknown-unknown" first.cpp -o first.out
97+
```
98+
99+
It is possible to pass additional arguments to the specific SYCL target backend. For example, to target specific architecture use:
100+
101+
- ``-Xsycl-target-backend=amdgcn-amd-amdhsa --offload-arch=gfx1030`` for AMD GPUs
102+
- ``-Xsycl-target-backend=nvptx64-nvidia-cuda --cuda-gpu-arch=sm_80`` for NVidia GPUs

content/en/docs/kde-cpp.md

Lines changed: 122 additions & 75 deletions
Original file line numberDiff line numberDiff line change
@@ -1,81 +1,128 @@
11
---
22
title: KDE DPC++ example
33
description: KDE (kernel density estimation) example using SYCL and DPC++.
4+
date: 2024-07-02
45
weight: 2
56
---
67

7-
{{% pageinfo %}}
8-
9-
These basic sample guidelines assume that your Docsy site is deployed using Netlify and your files are stored in GitHub. You can use the guidelines "as is" or adapt them with your own instructions: for example, other deployment options, information about your doc project's file structure, project-specific review guidelines, versioning guidelines, or any other information your users might find useful when updating your site. [Kubeflow](https://github.com/kubeflow/website/blob/master/README.md) has a great example.
10-
11-
Don't forget to link to your own doc repo rather than our example site! Also make sure users can find these guidelines from your doc repo README: either add them there and link to them from this page, add them here and link to them from the README, or include them in both locations.
12-
13-
{{% /pageinfo %}}
14-
15-
We use [Hugo](https://gohugo.io/) to format and generate our website, the
16-
[Docsy](https://github.com/google/docsy) theme for styling and site structure,
17-
and [Netlify](https://www.netlify.com/) to manage the deployment of the site.
18-
Hugo is an open-source static site generator that provides us with templates,
19-
content organisation in a standard directory structure, and a website generation
20-
engine. You write the pages in Markdown (or HTML if you want), and Hugo wraps them up into a website.
21-
22-
All submissions, including submissions by project members, require review. We
23-
use GitHub pull requests for this purpose. Consult
24-
[GitHub Help](https://help.github.com/articles/about-pull-requests/) for more
25-
information on using pull requests.
26-
27-
## Quick start with Netlify
28-
29-
Here's a quick guide to updating the docs. It assumes you're familiar with the
30-
GitHub workflow and you're happy to use the automated preview of your doc
31-
updates:
32-
33-
1. Fork the [Goldydocs repo](https://github.com/google/docsy-example) on GitHub.
34-
1. Make your changes and send a pull request (PR).
35-
1. If you're not yet ready for a review, add "WIP" to the PR name to indicate
36-
it's a work in progress. (**Don't** add the Hugo property
37-
"draft = true" to the page front matter, because that prevents the
38-
auto-deployment of the content preview described in the next point.)
39-
1. Wait for the automated PR workflow to do some checks. When it's ready,
40-
you should see a comment like this: **deploy/netlify — Deploy preview ready!**
41-
1. Click **Details** to the right of "Deploy preview ready" to see a preview
42-
of your updates.
43-
1. Continue updating your doc and pushing your changes until you're happy with
44-
the content.
45-
1. When you're ready for a review, add a comment to the PR, and remove any
46-
"WIP" markers.
47-
48-
## Updating a single page
49-
50-
If you've just spotted something you'd like to change while using the docs, Docsy has a shortcut for you:
51-
52-
1. Click **Edit this page** in the top right hand corner of the page.
53-
1. If you don't already have an up to date fork of the project repo, you are prompted to get one - click **Fork this repository and propose changes** or **Update your Fork** to get an up to date version of the project to edit. The appropriate page in your fork is displayed in edit mode.
54-
1. Follow the rest of the [Quick start with Netlify](#quick-start-with-netlify) process above to make, preview, and propose your changes.
55-
56-
## Previewing your changes locally
57-
58-
If you want to run your own local Hugo server to preview your changes as you work:
59-
60-
1. Follow the instructions in [Getting started](/docs/getting-started) to install Hugo and any other tools you need. You'll need at least **Hugo version 0.45** (we recommend using the most recent available version), and it must be the **extended** version, which supports SCSS.
61-
1. Fork the [Goldydocs repo](https://github.com/google/docsy-example) repo into your own project, then create a local copy using `git clone`. Don’t forget to use `--recurse-submodules` or you won’t pull down some of the code you need to generate a working site.
62-
63-
```
64-
git clone --recurse-submodules --depth 1 https://github.com/google/docsy-example.git
65-
```
66-
67-
1. Run `hugo server` in the site root directory. By default your site will be available at http://localhost:1313/. Now that you're serving your site locally, Hugo will watch for changes to the content and automatically refresh your site.
68-
1. Continue with the usual GitHub workflow to edit files, commit them, push the
69-
changes up to your fork, and create a pull request.
70-
71-
## Creating an issue
72-
73-
If you've found a problem in the docs, but you're not sure how to fix it yourself, please create an issue in the [Goldydocs repo](https://github.com/google/docsy-example/issues). You can also create an issue about a specific page by clicking the **Create Issue** button in the top right hand corner of the page.
74-
75-
## Useful resources
76-
77-
* [Docsy user guide](https://www.docsy.dev/docs/): All about Docsy, including how it manages navigation, look and feel, and multi-language support.
78-
* [Hugo documentation](https://gohugo.io/documentation/): Comprehensive reference for Hugo.
79-
* [Github Hello World!](https://guides.github.com/activities/hello-world/): A basic introduction to GitHub concepts and workflow.
80-
81-
8+
Given a sample of \\(n\\) observations \\(x_i\\) drawn from an unknown underlying continuous distribution \\(f(x)\\),
9+
the kernel density estimate of that density function is computed as follows, for some kernel
10+
smoothing parameter \\(h \in \mathbb{R}\\):
11+
12+
$$
13+
\hat{f}(x) = \frac{1}{n} \sum_{i=1}^{n} \frac{1}{h} K\left(\frac{x - x_i}{h}\right)
14+
$$
15+
16+
An example of NumPy code performing the estimation, for a common choice of kernel function as standard
17+
\\(d\\)-dimensional Gaussian distribution:
18+
19+
<!-- See https://stackoverflow.com/questions/5319754/cross-reference-named-anchor-in-markdown //-->
20+
<a id="kde_numpy" href=""></a>
21+
```python
22+
def kde(poi : np.ndarray, sample : np.ndarray, h : float) -> np.ndarray:
23+
"""Given a sample from underlying continuous distribution and
24+
a smoothing parameter `h`, evaluate density estimate at each point of
25+
interest `poi`.
26+
"""
27+
assert sample.ndim == 2
28+
assert poi.ndim == 2
29+
m, d1 = poi.shape
30+
n, d2 = sample.shape
31+
assert d1 == d2
32+
assert h > 0
33+
dm = np.sum(np.square(poi[:, np.newaxis, ...] - sample[np.newaxis, ...]), axis=-1)
34+
return np.mean(np.exp(dm/(-2*h*h)), axis=-1)/np.power(np.sqrt(2*np.pi) * h, d1)
35+
```
36+
37+
The code above evaluates \\(f(x)\\) for \\(m\\) values of points of interest \\(y_t\\).
38+
39+
$$
40+
f(y_t) = \frac{1}{n} \sum_{i=1}^{n} \frac{1}{h} K\left( \frac{1}{h^2} \left\lVert y_t - x_i \right\rVert^{2} \right), \;\;\; \forall 0 \leq t \le m
41+
$$
42+
43+
Evaluating such an expression can be done in parallel. Evaluation can be done independently for each \\(t\\).
44+
Furthermore, summation over \\(i\\) can be partitioned among work-items, each summing \\(n_{wi}\\) distinct terms.
45+
Such work partitioning would generate \\(m \cdot \left\lceil {n}/{n_{wi}}\right\rceil\\) independent tasks.
46+
Each work-item could write its partial sum into a dedicated temporary memory location to avoid race condition
47+
for further summation by another kernel operating in a similar fashion.
48+
49+
```cpp
50+
parallel_for(
51+
range<2>(m, ((n + n_wi - 1) / n_wi)),
52+
[=](sycl::item<2> it) {
53+
auto t = it.get_id(0);
54+
auto i_block = it.get_id(1);
55+
56+
T local_partial_sum = ...;
57+
58+
partial_sums[t * ((n + n_wi - 1) / n_wi) + i_block] = local_partial_sum;
59+
}
60+
);
61+
```
62+
63+
Such an approach, known as tree reduction, is implemented in ``kernel_density_esimation_temps`` function found in
64+
``"steps/kernel_density_estimation_cpp/kde.hpp"``.
65+
66+
Use of temporary allocation can be avoided if each work-item atomically adds the value of the local sum to the
67+
appropriate zero-initialized location in the output array, as in implementation ``kernel_density_estimation_atomic_ref``
68+
in the same header file:
69+
70+
```cpp
71+
parallel_for(
72+
range<2>(m, ((n + n_wi - 1) / n_wi)),
73+
[=](sycl::item<2> it) {
74+
auto t = it.get_id(0);
75+
auto i_block = it.get_id(1);
76+
77+
T local_partial_sum = ...;
78+
79+
sycl::atomic_ref<...> f_aref(f[t]);
80+
f_aref += local_partial_sum;
81+
}
82+
);
83+
```
84+
85+
Multiple work-items may concurrently updating the same location in global memory would produce the correct result due to
86+
use of ``sycl::atomic_ref`` but at the expense of increased number of attempts, phenomenon known as atomic pressure.
87+
Atomic pressure leads to thread divergence and degrades performance.
88+
89+
To reduce the atomic pressure work-items can be organized into work-groups. Every work-item in a work-group has access
90+
to local shared memory, dedicated on-chip memory, which can be used to cooperatively combine values held by work-items
91+
in the work-group without accessing the global memory. This could be done efficiently by calling group function
92+
``sycl::reduce_over_group``. To be able to call it, we must specify iteration range using ``sycl::nd_range`` rather than
93+
``sycl::range`` as we did earlier.
94+
95+
```cpp
96+
auto wg = 256; // work-group-size
97+
auto n_data_per_wg = n_wi * wg;
98+
auto n_groups = ((n + n_data_per_wg - 1) / n_data_per_Wg);
99+
100+
range<2> gRange(m, n_groups * wg);
101+
range<2> lRange(1, wg);
102+
103+
parallel_for(
104+
nd_range<2>(gRange, lRange),
105+
[=](sycl::nd_item<2> it) {
106+
auto t = it.get_global_id(0);
107+
108+
T local_partial_sum = ...;
109+
110+
auto work_group = it.get_group();
111+
T sum_over_wg = sycl::reduce_over_group(work_group, local_sum, sycl::plus<>());
112+
113+
if (work_group.leader()) {
114+
sycl::atomic_ref<...> f_aref(f[t]);
115+
f_aref += sum_over_wg;
116+
}
117+
}
118+
);
119+
```
120+
121+
Complete implementation can be found in ``kernel_density_estimation_work_group_reduce_and_atomic_ref`` function
122+
in ``"steps/kernel_density_estimation_cpp/kde.hpp"``.
123+
124+
These implementations are called from C++ application ``"steps/kernel_density_estimation_cpp/app.cpp"``, which
125+
samples data uniformly distributed over unit cuboid, and estimates the density using Kernel Density Estimation
126+
and spherically symmetric multivariate Gaussian probability density function as the kernel.
127+
128+
The application can be built using `CMake`, or `Meson`, please refer to [README](steps/kernel_density_estimation_cpp/README.md) document in that folder.

0 commit comments

Comments
 (0)