Add vectorisation support (AVX, OpenMP SIMD) #827
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Overview
This PR introduces improved vectorisation in performance-critical collision step routines (e.g.,
CalculateDensityAndMomentum,CalculateFeq). Two approaches were added: compiler-guided vectorisation via OpenMP SIMD directives, and explicit 256-bit AVX intrinsics.Enabling explicit 256-bit AVX vectorisation is configurable via
-DHEMELB_USE_AVX=ON/OFFbuild option. AVX is disabled by default. Enabling OpenMP SIMD is configurable via-DHEMELB_USE_OPENMP_SIMD=ON/OFFbuild option. OpenMP SIMD is disabled by default.Results
Across all systems and compilers tested, the AVX version consistently provides the best performance and scalability (outperforming the default SSE3 version). The OpenMP SIMD version only brings modest gains with GNU compilers on both ARCHER2 and Cirrus (compared to the non-vectorised version), and worse performance than the explicit SSE3 version, but with Cray compilers on ARCHER2, it is able to match performance of the AVX version while offering better code maintainability and portability across platforms.
Note: To compile HemeLB with the current Cray compilers (cce/16.0.1) on ARCHER2, it required the following minor workarounds:
For full performance comparison please find the plots below.
ARCHER2
Figure 1: Vectorisation: speedup for the retina dataset (40,000 time steps) on ARCHER2 using GNU compilers, 128 execution units per node.
Figure 2: Vectorisation: speedup for the retina dataset (40,000 time steps) on ARCHER2 using Cray compilers, 128 execution units per node.
Cirrus
Figure 3: Vectorisation: speedup for the retina dataset (40,000 time steps) on Cirrus using GNU compilers, 128 execution units per node.