|
| 1 | +# SYCL_EXT_ONEAPI_MAX_WORK_GROUP_QUERY |
| 2 | + |
| 3 | +## Notice |
| 4 | + |
| 5 | +This document describes an **experimental** API that applications can use to try |
| 6 | +out a new feature. Future versions of this API may change in ways that are |
| 7 | +incompatible with this experimental version. |
| 8 | + |
| 9 | + |
| 10 | +## Introduction |
| 11 | + |
| 12 | +This extension adds functionally two new device information descriptors. They provide the ability to query a device for the maximum numbers of work-groups that can be submitted in each dimension as well as globally (across all dimensions). |
| 13 | + |
| 14 | +OpenCL never offered such query - which is probably why it is absent from SYCL. Now that SYCL supports back-ends where the maximum number of work-groups in each dimension can be different, having the ability to query that limit is crucial in writing safe and portable code. |
| 15 | + |
| 16 | +## Feature test macro |
| 17 | + |
| 18 | +As encouraged by the SYCL specification, a feature-test macro, `SYCL_EXT_ONEAPI_MAX_WORK_GROUP_QUERY`, is provided to determine whether this extension is implemented. |
| 19 | + |
| 20 | +## New device descriptors |
| 21 | + |
| 22 | +| Device descriptors | Return type | Description | |
| 23 | +| ------------------------------------------------------ | ----------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | |
| 24 | +| info::device::ext_oneapi_max_work_groups_1d | id<1> | Returns the maximum number of work-groups that can be submitted in each dimension of the `globalSize` of a `nd_range<1>`. The minimum value is `(1)` if the device is different than `info::device_type::custom`. | |
| 25 | +| info::device::ext_oneapi_max_work_groups_2d | id<2> | Returns the maximum number of work-groups that can be submitted in each dimension of the `globalSize` of a `nd_range<2>`. The minimum value is `(1, 1)` if the device is different than `info::device_type::custom`. | |
| 26 | +| info::device::ext_oneapi_max_work_groups_3d | id<3> | Returns the maximum number of work-groups that can be submitted in each dimension of the `globalSize` of a `nd_range<3>`. The minimum value is `(1, 1, 1)` if the device is different than `info::device_type::custom`. | |
| 27 | +| info::device::ext_oneapi_max_global_work_groups | size_t | Returns the maximum number of work-groups that can be submitted across all the dimensions. The minimum value is `1`. | |
| 28 | + |
| 29 | +### Note |
| 30 | + |
| 31 | +- The returned values have the same ordering as the `nd_range` arguments. |
| 32 | +- The implementation does not guarantee that the user could select all the maximum numbers returned by `ext_oneapi_max_work_groups` at the same time. Thus the user should also check that the selected number of work-groups across all dimensions is smaller than the maximum global number returned by `ext_oneapi_max_global_work_groups`. |
| 33 | + |
| 34 | +## Examples |
| 35 | + |
| 36 | +```c++ |
| 37 | +sycl::device gpu = sycl::device{sycl::gpu_selector{}}; |
| 38 | +std::cout << gpu.get_info<sycl::info::device::name>() << '\n'; |
| 39 | + |
| 40 | +#ifdef SYCL_EXT_ONEAPI_MAX_WORK_GROUP_QUERY |
| 41 | +sycl::id<3> groups = gpu.get_info<sycl::info::device::ext_oneapi_max_work_groups_3d>(); |
| 42 | +size_t global_groups = gpu.get_info<sycl::info::device::ext_oneapi_max_global_work_groups>(); |
| 43 | +std::cout << "Max number groups: x_max: " << groups[2] << " y_max: " << groups[1] << " z_max: " << groups[0] << '\n'; |
| 44 | +std::cout << "Max global number groups: " << global_groups << '\n'; |
| 45 | +#endif |
| 46 | +``` |
| 47 | +
|
| 48 | +Ouputs to the console: |
| 49 | +
|
| 50 | +``` |
| 51 | +NVIDIA ... |
| 52 | +Max number groups: x_max: 2147483647 y_max: 65535 z_max: 65535 |
| 53 | +Max global number groups: 2147483647 |
| 54 | +``` |
| 55 | +
|
| 56 | +See: [CUDA Toolkit Documentation](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-capabilities) |
| 57 | +
|
| 58 | +Then the following assertions should be satisfied at kernel submission: |
| 59 | +
|
| 60 | +```C++ |
| 61 | +sycl::nd_range<3> work_range(global_size, local_size); |
| 62 | +
|
| 63 | +assert(global_size[2] <= groups[2] |
| 64 | + && global_size[1] <= groups[1] |
| 65 | + && global_size[0] <= groups[0]); |
| 66 | +
|
| 67 | +assert(global_size[2] * global_size[1] * global_size[0] <= global_groups); //Make sure not to exceed integer representation size in the multiplication. |
| 68 | +
|
| 69 | +gpu_queue.submit(work_range, ...); |
| 70 | +``` |
| 71 | + |
| 72 | +## Implementation |
| 73 | + |
| 74 | +### Templated queries |
| 75 | + |
| 76 | +Right now, DPC++ does not support templated device descriptors as they are defined in the SYCL specification section 4.6.4.2 "Device information descriptors". When the implementation supports this syntax, `ext_oneapi_max_work_groups_[1,2,3]d` should be replaced by the templated syntax: `ext_oneapi_max_work_groups<[1,2,3]>`. |
| 77 | +### Consistency with existing checks |
| 78 | + |
| 79 | +The implementation already checks when enqueuing a kernel that the global and per dimension work-group number is smaller than `std::numeric_limits<int>::max`. This check is implemented in `sycl/include/CL/sycl/handler.hpp`. For consistency, values returned by the two device descriptors are bound by this limit. |
| 80 | + |
| 81 | +### Example of returned values |
| 82 | + |
| 83 | +- If the device is the host or has an OpenCL back-end, the values returned - as they are not applicable - are the maximum values accepted at kernel submission (see `sycl/include/CL/sycl/handler.hpp`) which are currently `std::numeric_limits<int>::max`. |
| 84 | +- CUDA: Back-end query using `CU_DEVICE_ATTRIBUTE_MAX_GRID_DIM_[X,Y,Z]`. |
0 commit comments