Skip to content

Conversation

@toretto-uk
Copy link

@toretto-uk toretto-uk commented Aug 17, 2025

Overview

This PR introduces improved vectorisation in performance-critical collision step routines (e.g., CalculateDensityAndMomentum, CalculateFeq). Two approaches were added: compiler-guided vectorisation via OpenMP SIMD directives, and explicit 256-bit AVX intrinsics.

Enabling explicit 256-bit AVX vectorisation is configurable via -DHEMELB_USE_AVX=ON/OFF build option. AVX is disabled by default. Enabling OpenMP SIMD is configurable via -DHEMELB_USE_OPENMP_SIMD=ON/OFF build option. OpenMP SIMD is disabled by default.

Results

Across all systems and compilers tested, the AVX version consistently provides the best performance and scalability (outperforming the default SSE3 version). The OpenMP SIMD version only brings modest gains with GNU compilers on both ARCHER2 and Cirrus (compared to the non-vectorised version), and worse performance than the explicit SSE3 version, but with Cray compilers on ARCHER2, it is able to match performance of the AVX version while offering better code maintainability and portability across platforms.

Note: To compile HemeLB with the current Cray compilers (cce/16.0.1) on ARCHER2, it required the following minor workarounds:

For full performance comparison please find the plots below.

ARCHER2

image

Figure 1: Vectorisation: speedup for the retina dataset (40,000 time steps) on ARCHER2 using GNU compilers, 128 execution units per node.

image

Figure 2: Vectorisation: speedup for the retina dataset (40,000 time steps) on ARCHER2 using Cray compilers, 128 execution units per node.

Cirrus

image

Figure 3: Vectorisation: speedup for the retina dataset (40,000 time steps) on Cirrus using GNU compilers, 128 execution units per node.

@toretto-uk toretto-uk marked this pull request as ready for review September 16, 2025 07:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant