[SYCL] Improve range reduction performance on CPU (#6164)

Pennycook · web-flow · commit 3323da65a8b3 · 2022-08-16T08:50:42.000-07:00
The performance improvement is the result of two complementary changes:

Using an alternative heuristic to select work-group size on the CPU.
Keeping work-groups small simplifies combination of partial results
and reduces the number of temporary variables.

Adjusting the mapping of the range to an ND-range.
Breaking the range into contiguous chunks that are assigned to each
results in streaming patterns that are better-suited to prefetching
hardware.

Signed-off-by: John Pennycook john.pennycook@intel.com
diff --git a/sycl/doc/EnvironmentVariables.md b/sycl/doc/EnvironmentVariables.md
@@ -24,6 +24,7 @@ compiler and runtime.
 | `SYCL_RT_WARNING_LEVEL` | Positive integer | The higher warning level is used the more warnings and performance hints the runtime library may print. Default value is '0', which means no warning/hint messages from the runtime library are allowed. The value '1' enables performance warnings from device runtime/codegen. The values greater than 1 are reserved for future use. |
 | `SYCL_USM_HOSTPTR_IMPORT` | Integer | Enable by specifying non-zero value. Buffers created with a host pointer will result in host data promotion to USM, improving data transfer performance. To use this feature, also set SYCL_HOST_UNIFIED_MEMORY=1. |
 | `SYCL_EAGER_INIT` | Integer | Enable by specifying non-zero value. Tells the SYCL runtime to do as much as possible initialization at objects construction as opposed to doing lazy initialization on the fly. This may mean doing some redundant work at warmup but ensures fastest possible execution on the following hot and reportable paths. It also instructs PI plugins to do the same. Default is "0".  |
+| `SYCL_REDUCTION_PREFERRED_WORKGROUP_SIZE` | See [below](#sycl_reduction_preferred_workgroup_size) | Controls the preferred work-group size of reductions. |
 
 `(*) Note: Any means this environment variable is effective when set to any non-null value.`
 
@@ -60,6 +61,32 @@ Assuming a filter has all three elements of the triple, it selects only those de
 
 Note that all device selectors will throw an exception if the filtered list of devices does not include a device that satisfies the selector. For instance, `SYCL_DEVICE_FILTER=cpu,level_zero` will cause `host_selector()` to throw an exception. `SYCL_DEVICE_FILTER` also limits loading only specified plugins into the SYCL RT. In particular, `SYCL_DEVICE_FILTER=level_zero` will cause the `cpu_selector` to throw an exception since SYCL RT will only load the `level_zero` backend which does not support any CPU devices at this time. When multiple devices satisfy the filter (e..g, `SYCL_DEVICE_FILTER=gpu`), only one of them will be selected.
 
+## `SYCL_REDUCTION_PREFERRED_WORKGROUP_SIZE`
+
+This environment variable controls the preferred work-group size for reductions on specified device types. Setting this will affect all reductions without an explicitly specified work-group size on devices of types in the value of the environment variable.
+
+The value of this environment variable is a comma separated list of one or more configurations, where each configuration is a pair of the form "`device_type`:`size`" (without the quotes). Possible values of `device_type` are:
+- `cpu`
+- `gpu`
+- `acc`
+- `*`
+
+`size` is a positive integer larger than 0.
+
+For a configuration `device_type`:`size` the `device_type` element specifies the type of device the configuration applies to, that is `cpu` is for CPU devices, `gpu` is for GPU devices, and `acc` is for accelerator devices. If `device_type` is `*` the configuration applies to all applicable device types. `size` denotes the preferred work-group size to be used for devices of types specified by `device_type`.
+
+If `info::device::max_work_group_size` on a device on which a reduction is being enqueued is less than the value specified by a configuration in this environment variable, the value of `info::device::max_work_group_size` on that device is used instead.
+
+A `sycl::exception` with `sycl::errc::invalid` is thrown during submission of a reduction kernel in the following cases:
+- If the specified device type in any configuration is not one of the valid values.
+- If the specified preferred work-group size in any configuration is not a valid integer.
+- If the specified preferred work-group size in any configuration is not an integer value larger than 0.
+- If any configuration does not have the `:` delimiter.
+
+If this environment variable is not set, the preferred work-group size for reductions is implementation defined.
+
+Note that conflicting configuration tuples in the same list will favor the last entry. For example, a list `cpu:32,gpu:32,cpu:16` will set the preferred work-group size of reductions to 32 for GPUs and 16 for CPUs. This also applies to `*`, for example `cpu:32,*:16` sets the preferred work-group size of reductions on all devices to 16, while `*:16,cpu:32` sets the preferred work-group size of reductions to 32 on CPUs and to 16 on all other devices.
+
 ## Controlling DPC++ Level Zero Plugin
 
 | Environment variable | Values | Description |
diff --git a/sycl/include/sycl/ext/oneapi/reduction.hpp b/sycl/include/sycl/ext/oneapi/reduction.hpp
@@ -147,6 +147,8 @@ __SYCL_EXPORT size_t reduGetMaxWGSize(std::shared_ptr<queue_impl> Queue,
                                       size_t LocalMemBytesPerWorkItem);
 __SYCL_EXPORT size_t reduComputeWGSize(size_t NWorkItems, size_t MaxWGSize,
                                        size_t &NWorkGroups);
+__SYCL_EXPORT size_t reduGetPreferredWGSize(std::shared_ptr<queue_impl> &Queue,
+                                            size_t LocalMemBytesPerWorkItem);
 
 /// Class that is used to represent objects that are passed to user's lambda
 /// functions and representing users' reduction variable.
@@ -890,16 +892,28 @@ using __sycl_reduction_kernel =
                        sycl::detail::auto_name, Namer<KernelName, Ts...>>;
 
 /// Called in device code. This function iterates through the index space
-/// \p Range using stride equal to the global range specified in \p NdId,
+/// by assigning contiguous chunks to each work-group, then iterating
+/// through each chunk using a stride equal to the work-group's local range,
 /// which gives much better performance than using stride equal to 1.
 /// For each of the index the given \p F function/functor is called and
 /// the reduction value hold in \p Reducer is accumulated in those calls.
 template <typename KernelFunc, int Dims, typename ReducerT>
-void reductionLoop(const range<Dims> &Range, ReducerT &Reducer,
-                   const nd_item<1> &NdId, KernelFunc &F) {
-  size_t Start = NdId.get_global_id(0);
-  size_t End = Range.size();
-  size_t Stride = NdId.get_global_range(0);
+void reductionLoop(const range<Dims> &Range, const size_t PerGroup,
+                   ReducerT &Reducer, const nd_item<1> &NdId, KernelFunc &F) {
+  // Divide into contiguous chunks and assign each chunk to a Group
+  // Rely on precomputed division to avoid repeating expensive operations
+  // TODO: Some devices may prefer alternative remainder handling
+  auto Group = NdId.get_group();
+  size_t GroupId = Group.get_group_linear_id();
+  size_t NumGroups = Group.get_group_linear_range();
+  bool LastGroup = (GroupId == NumGroups - 1);
+  size_t GroupStart = GroupId * PerGroup;
+  size_t GroupEnd = LastGroup ? Range.size() : (GroupStart + PerGroup);
+
+  // Loop over the contiguous chunk
+  size_t Start = GroupStart + NdId.get_local_id(0);
+  size_t End = GroupEnd;
+  size_t Stride = NdId.get_local_range(0);
   for (size_t I = Start; I < End; I += Stride)
     F(sycl::detail::getDelinearizedId(Range, I), Reducer);
 }
@@ -919,10 +933,12 @@ bool reduCGFuncForRangeFastAtomics(handler &CGH, KernelType KernelFunc,
   auto GroupSum = Reduction::getReadWriteLocalAcc(NElements, CGH);
   using Name = __sycl_reduction_kernel<reduction::main_krn::RangeFastAtomics,
                                        KernelName>;
+  size_t NWorkGroups = NDRange.get_group_range().size();
+  size_t PerGroup = Range.size() / NWorkGroups;
   CGH.parallel_for<Name>(NDRange, [=](nd_item<1> NDId) {
     // Call user's functions. Reducer.MValue gets initialized there.
     typename Reduction::reducer_type Reducer;
-    reductionLoop(Range, Reducer, NDId, KernelFunc);
+    reductionLoop(Range, PerGroup, Reducer, NDId, KernelFunc);
 
     // Work-group cooperates to initialize multiple reduction variables
     auto LID = NDId.get_local_id(0);
@@ -987,10 +1003,11 @@ bool reduCGFuncForRangeFastReduce(handler &CGH, KernelType KernelFunc,
 
   using Name =
       __sycl_reduction_kernel<reduction::main_krn::RangeFastReduce, KernelName>;
+  size_t PerGroup = Range.size() / NWorkGroups;
   CGH.parallel_for<Name>(NDRange, [=](nd_item<1> NDId) {
     // Call user's functions. Reducer.MValue gets initialized there.
     typename Reduction::reducer_type Reducer;
-    reductionLoop(Range, Reducer, NDId, KernelFunc);
+    reductionLoop(Range, PerGroup, Reducer, NDId, KernelFunc);
 
     typename Reduction::binary_operation BOp;
     auto Group = NDId.get_group();
@@ -1081,10 +1098,11 @@ bool reduCGFuncForRangeBasic(handler &CGH, KernelType KernelFunc,
   auto BOp = Redu.getBinaryOperation();
   using Name =
       __sycl_reduction_kernel<reduction::main_krn::RangeBasic, KernelName>;
+  size_t PerGroup = Range.size() / NWorkGroups;
   CGH.parallel_for<Name>(NDRange, [=](nd_item<1> NDId) {
     // Call user's functions. Reducer.MValue gets initialized there.
     typename Reduction::reducer_type Reducer(Identity, BOp);
-    reductionLoop(Range, Reducer, NDId, KernelFunc);
+    reductionLoop(Range, PerGroup, Reducer, NDId, KernelFunc);
 
     // If there are multiple values, reduce each separately
     // This prevents local memory from scaling with elements
diff --git a/sycl/include/sycl/handler.hpp b/sycl/include/sycl/handler.hpp
@@ -303,6 +303,9 @@ reduGetMaxNumConcurrentWorkGroups(std::shared_ptr<queue_impl> Queue);
 __SYCL_EXPORT size_t reduGetMaxWGSize(std::shared_ptr<queue_impl> Queue,
                                       size_t LocalMemBytesPerWorkItem);
 
+__SYCL_EXPORT size_t reduGetPreferredWGSize(std::shared_ptr<queue_impl> &Queue,
+                                            size_t LocalMemBytesPerWorkItem);
+
 template <typename... ReductionT, size_t... Is>
 size_t reduGetMemPerWorkItem(std::tuple<ReductionT...> &ReduTuple,
                              std::index_sequence<Is...>);
@@ -1618,13 +1621,13 @@ class __SYCL_EXPORT handler {
 #else
         ext::oneapi::detail::reduGetMaxNumConcurrentWorkGroups(MQueue);
 #endif
-    // TODO: currently the maximal work group size is determined for the given
+    // TODO: currently the preferred work group size is determined for the given
     // queue/device, while it is safer to use queries to the kernel pre-compiled
     // for the device.
-    size_t MaxWGSize =
-        ext::oneapi::detail::reduGetMaxWGSize(MQueue, OneElemSize);
+    size_t PrefWGSize =
+        ext::oneapi::detail::reduGetPreferredWGSize(MQueue, OneElemSize);
     if (ext::oneapi::detail::reduCGFuncForRange<KernelName>(
-            *this, KernelFunc, Range, MaxWGSize, NumConcurrentWorkGroups,
+            *this, KernelFunc, Range, PrefWGSize, NumConcurrentWorkGroups,
             Redu)) {
       this->finalize();
       MLastEvent = withAuxHandler(QueueCopy, [&](handler &CopyHandler) {
diff --git a/sycl/source/detail/config.def b/sycl/source/detail/config.def
@@ -38,3 +38,4 @@ CONFIG(INTEL_ENABLE_OFFLOAD_ANNOTATIONS, 1, __SYCL_INTEL_ENABLE_OFFLOAD_ANNOTATI
 CONFIG(SYCL_ENABLE_DEFAULT_CONTEXTS, 1, __SYCL_ENABLE_DEFAULT_CONTEXTS)
 CONFIG(SYCL_QUEUE_THREAD_POOL_SIZE, 4, __SYCL_QUEUE_THREAD_POOL_SIZE)
 CONFIG(SYCL_RT_WARNING_LEVEL, 4, __SYCL_RT_WARNING_LEVEL)
+CONFIG(SYCL_REDUCTION_PREFERRED_WORKGROUP_SIZE, 16, __SYCL_REDUCTION_PREFERRED_WORKGROUP_SIZE)
diff --git a/sycl/source/detail/config.hpp b/sycl/source/detail/config.hpp
@@ -102,6 +102,11 @@ template <ConfigID Config> class SYCLConfigBase;
 #include "config.def"
 #undef CONFIG
 
+#define INVALID_CONFIG_EXCEPTION(BASE, MSG)                                    \
+  sycl::exception(sycl::make_error_code(sycl::errc::invalid),                  \
+                  "Invalid value for " + std::string{BASE::MConfigName} +      \
+                      " environment variable: " + MSG)
+
 template <ConfigID Config> class SYCLConfig {
   using BaseT = SYCLConfigBase<Config>;
 
@@ -467,6 +472,127 @@ template <> class SYCLConfig<SYCL_CACHE_DIR> {
   }
 };
 
+template <> class SYCLConfig<SYCL_REDUCTION_PREFERRED_WORKGROUP_SIZE> {
+  using BaseT = SYCLConfigBase<SYCL_REDUCTION_PREFERRED_WORKGROUP_SIZE>;
+
+  struct ParsedValue {
+    size_t CPU = 0;
+    size_t GPU = 0;
+    size_t Accelerator = 0;
+  };
+
+public:
+  static size_t get(info::device_type DeviceType) {
+    ParsedValue Value = getCachedValue();
+    return getRefByDeviceType(Value, DeviceType);
+  }
+
+  static void reset() { (void)getCachedValue(/*ResetCache=*/true); }
+
+  static const char *getName() { return BaseT::MConfigName; }
+
+private:
+  static size_t &getRefByDeviceType(ParsedValue &Value,
+                                    info::device_type DeviceType) {
+    switch (DeviceType) {
+    case info::device_type::cpu:
+      return Value.CPU;
+    case info::device_type::gpu:
+      return Value.GPU;
+    case info::device_type::accelerator:
+      return Value.Accelerator;
+    default:
+      // Expect to get here if user used wrong device type. Include wildcard
+      // in the message even though it's handled in the caller.
+      throw INVALID_CONFIG_EXCEPTION(
+          BaseT, "Device types must be \"cpu\", \"gpu\", \"acc\", or \"*\".");
+    }
+  }
+
+  static ParsedValue parseValue() {
+    const char *ValueRaw = BaseT::getRawValue();
+    ParsedValue Result{};
+
+    // Default to 0 to signify an unset value.
+    if (!ValueRaw)
+      return Result;
+
+    std::string ValueStr{ValueRaw};
+    auto DeviceTypeMap = getSyclDeviceTypeMap();
+
+    // Iterate over all configurations.
+    size_t Start = 0, End = 0;
+    do {
+      End = ValueStr.find(',', Start);
+      if (End == std::string::npos)
+        End = ValueStr.size();
+
+      // Get a substring of the current configuration pair.
+      std::string DeviceConfigStr = ValueStr.substr(Start, End - Start);
+
+      // Find the delimiter in the configuration pair.
+      size_t ConfigDelimLoc = DeviceConfigStr.find(':');
+      if (ConfigDelimLoc == std::string::npos)
+        throw INVALID_CONFIG_EXCEPTION(
+            BaseT, "Device-value pair \"" + DeviceConfigStr +
+                       "\" does not contain the ':' delimiter.");
+
+      // Split configuration pair into its constituents.
+      std::string DeviceConfigTypeStr =
+          DeviceConfigStr.substr(0, ConfigDelimLoc);
+      std::string DeviceConfigValueStr = DeviceConfigStr.substr(
+          ConfigDelimLoc + 1, DeviceConfigStr.size() - ConfigDelimLoc - 1);
+
+      // Find the device type in the "device type map".
+      auto DeviceTypeIter = std::find_if(
+          std::begin(DeviceTypeMap), std::end(DeviceTypeMap),
+          [&](auto Element) { return DeviceConfigTypeStr == Element.first; });
+      if (DeviceTypeIter == DeviceTypeMap.end())
+        throw INVALID_CONFIG_EXCEPTION(
+            BaseT,
+            "\"" + DeviceConfigTypeStr + "\" is not a recognized device type.");
+
+      // Parse the configuration value.
+      int DeviceConfigValue = 1;
+      try {
+        DeviceConfigValue = std::stoi(DeviceConfigValueStr);
+      } catch (...) {
+        throw INVALID_CONFIG_EXCEPTION(
+            BaseT, "Value \"" + DeviceConfigValueStr + "\" must be a number");
+      }
+
+      if (DeviceConfigValue < 1)
+        throw INVALID_CONFIG_EXCEPTION(BaseT,
+                                       "Value \"" + DeviceConfigValueStr +
+                                           "\" must be larger than zero");
+
+      if (DeviceTypeIter->second == info::device_type::all) {
+        // Set all configuration values if we got the device-type wildcard.
+        Result.GPU = DeviceConfigValue;
+        Result.CPU = DeviceConfigValue;
+        Result.Accelerator = DeviceConfigValue;
+      } else {
+        // Try setting the corresponding configuration.
+        getRefByDeviceType(Result, DeviceTypeIter->second) = DeviceConfigValue;
+      }
+
+      // Move to the start of the next configuration. If the start is outside
+      // the full value string we are done.
+      Start = End + 1;
+    } while (Start < ValueStr.size());
+    return Result;
+  }
+
+  static ParsedValue getCachedValue(bool ResetCache = false) {
+    static ParsedValue Val = parseValue();
+    if (ResetCache)
+      Val = parseValue();
+    return Val;
+  }
+};
+
+#undef INVALID_CONFIG_EXCEPTION
+
 } // namespace detail
 } // __SYCL_INLINE_VER_NAMESPACE(_V1)
 } // namespace sycl
diff --git a/sycl/source/detail/reduction.cpp b/sycl/source/detail/reduction.cpp
@@ -6,6 +6,7 @@
 //
 //===----------------------------------------------------------------------===//
 
+#include <detail/config.hpp>
 #include <detail/queue_impl.hpp>
 #include <sycl/ext/oneapi/reduction.hpp>
 
@@ -67,6 +68,7 @@ reduGetMaxWGSize(std::shared_ptr<sycl::detail::queue_impl> Queue,
                  size_t LocalMemBytesPerWorkItem) {
   device Dev = Queue->get_device();
   size_t MaxWGSize = Dev.get_info<info::device::max_work_group_size>();
+
   size_t WGSizePerMem = MaxWGSize * 2;
   size_t WGSize = MaxWGSize;
   if (LocalMemBytesPerWorkItem != 0) {
@@ -93,21 +95,54 @@ reduGetMaxWGSize(std::shared_ptr<sycl::detail::queue_impl> Queue,
   // the local memory assigned to one work-group by code in another work-group.
   // It seems the only good solution for this work-group detection problem is
   // kernel precompilation and querying the kernel properties.
-  if (WGSize >= 4) {
+  if (WGSize >= 4 && WGSizePerMem < MaxWGSize * 2) {
     // Let's return a twice smaller number, but... do that only if the kernel
-    // is limited by memory, or the kernel uses opencl:cpu backend, which
-    // surprisingly uses lots of resources to run the kernels with reductions
-    // and often causes CL_OUT_OF_RESOURCES error even when reduction
-    // does not use local accessors.
-    if (WGSizePerMem < MaxWGSize * 2 ||
-        (Queue->get_device().is_cpu() &&
-         Queue->get_device().get_platform().get_backend() == backend::opencl))
-      WGSize /= 2;
+    // is limited by memory.
+    WGSize /= 2;
   }
 
   return WGSize;
 }
 
+__SYCL_EXPORT size_t reduGetPreferredWGSize(std::shared_ptr<queue_impl> &Queue,
+                                            size_t LocalMemBytesPerWorkItem) {
+  device Dev = Queue->get_device();
+
+  // The maximum WGSize returned by CPU devices is very large and does not
+  // help the reduction implementation: since all work associated with a
+  // work-group is typically assigned to one CPU thread, selecting a large
+  // work-group size unnecessarily increases the number of accumulators.
+  // The default of 16 was chosen based on empirical benchmarking results;
+  // an environment variable is provided to allow users to override this
+  // behavior.
+  using PrefWGConfig = sycl::detail::SYCLConfig<
+      sycl::detail::SYCL_REDUCTION_PREFERRED_WORKGROUP_SIZE>;
+  if (Dev.is_cpu()) {
+    size_t CPUMaxWGSize = PrefWGConfig::get(info::device_type::cpu);
+    if (CPUMaxWGSize == 0)
+      return 16;
+    size_t DevMaxWGSize = Dev.get_info<info::device::max_work_group_size>();
+    return std::min(CPUMaxWGSize, DevMaxWGSize);
+  }
+
+  // If the user has specified an explicit preferred work-group size we use
+  // that.
+  if (Dev.is_gpu() && PrefWGConfig::get(info::device_type::gpu)) {
+    size_t DevMaxWGSize = Dev.get_info<info::device::max_work_group_size>();
+    return std::min(PrefWGConfig::get(info::device_type::gpu), DevMaxWGSize);
+  }
+
+  if (Dev.is_accelerator() &&
+      PrefWGConfig::get(info::device_type::accelerator)) {
+    size_t DevMaxWGSize = Dev.get_info<info::device::max_work_group_size>();
+    return std::min(PrefWGConfig::get(info::device_type::accelerator),
+                    DevMaxWGSize);
+  }
+
+  // Use the maximum work-group size otherwise.
+  return reduGetMaxWGSize(Queue, LocalMemBytesPerWorkItem);
+}
+
 } // namespace detail
 } // namespace oneapi
 } // namespace ext
diff --git a/sycl/test/abi/sycl_symbols_linux.dump b/sycl/test/abi/sycl_symbols_linux.dump
@@ -3666,6 +3666,7 @@ _ZN4sycl3_V13ext6oneapi15filter_selectorC1ERKNSt7__cxx1112basic_stringIcSt11char
 _ZN4sycl3_V13ext6oneapi15filter_selectorC2ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE
 _ZN4sycl3_V13ext6oneapi6detail16reduGetMaxWGSizeESt10shared_ptrINS0_6detail10queue_implEEm
 _ZN4sycl3_V13ext6oneapi6detail17reduComputeWGSizeEmmRm
+_ZN4sycl3_V13ext6oneapi6detail22reduGetPreferredWGSizeERSt10shared_ptrINS0_6detail10queue_implEEm
 _ZN4sycl3_V13ext6oneapi6detail33reduGetMaxNumConcurrentWorkGroupsESt10shared_ptrINS0_6detail10queue_implEE
 _ZN4sycl3_V14freeEPvRKNS0_5queueERKNS0_6detail13code_locationE
 _ZN4sycl3_V14freeEPvRKNS0_7contextERKNS0_6detail13code_locationE
diff --git a/sycl/unittests/config/CMakeLists.txt b/sycl/unittests/config/CMakeLists.txt
@@ -2,4 +2,5 @@ set(CMAKE_CXX_EXTENSIONS OFF)
 
 add_sycl_unittest(ConfigTests OBJECT 
     ConfigTests.cpp
+    PreferredWGSizeConfigTests.cpp
 )
diff --git a/sycl/unittests/config/PreferredWGSizeConfigTests.cpp b/sycl/unittests/config/PreferredWGSizeConfigTests.cpp

Original file line number	Diff line number	Diff line change
`@@ -2,4 +2,5 @@ set(CMAKE_CXX_EXTENSIONS OFF)`
`2`	`2`
`3`	`3`	`add_sycl_unittest(ConfigTests OBJECT`
`4`	`4`	`ConfigTests.cpp`
	`5`	`+ PreferredWGSizeConfigTests.cpp`
`5`	`6`	`)`