Skip to content

Conversation

@toretto-uk
Copy link

@toretto-uk toretto-uk commented Aug 17, 2025

Overview

This PR introduces hybrid parallelism by integrating OpenMP into the existing MPI-based code. Computationally intensive collision and streaming parts were parallelised with OpenMP loops with the intention to better exploit shared-memory parallelism within nodes.

Enabling OpenMP is configurable via -DHEMELB_USE_OPENMP=ON/OFF build option. OpenMP is disabled by default.

Results

The pure MPI reference implementation consistently delivers the best performance and scalability across all tested configurations, compilers and platforms. However, at low node counts, the OpenMP version shows promising results, slightly outperforming the pure MPI version. That suggests that potentially, on a larger input geometry with more lattice sites per rank (more iterations for the OpenMP loops), it could still be beneficial to use OpenMP.

For full performance comparison please find the plots below.

ARCHER2

image

Figure 1: Hybrid parallelism: speedup for the retina dataset (40,000 time steps) on ARCHER2 using GNU compilers, 128 execution units per node.

image

Figure 2: Hybrid parallelism: speedup for the retina dataset (40,000 time steps) on ARCHER2 using Cray compilers, 128 execution units per node.

image

Figure 3: Hybrid Parallelism: simulation time on 4 nodes on ARCHER2 using GNU compilers, 128 execution units per node.

Cirrus

image

Figure 4: Hybrid parallelism: speedup for the retina dataset (40,000 time steps) on Cirrus using GNU compilers, 128 execution units per node.

@toretto-uk toretto-uk marked this pull request as ready for review September 16, 2025 07:39
Copy link
Member

@rupertnash rupertnash left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the work! I will have to review your dissertation to get the full details of the change in performance, but there are a few minor code problem before this can be considered for a merge. I notice that you haven't touched the initialisation code (in lb::InitialCondition) which can make a huge difference in performance when NUMA effects are in play (due to the typical first touch page allocation strategy used)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to use find_package(OpenMP) and then target_link_libraries(... OpenMP::OpenMP_CXX)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unacceptable use of ifdef in new code. Should refactor to minimise the code that is different when OpenMP enabled. If different code is required, use if constexpr

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants