Add hybrid MPI+OpenMP support #826
                
     Open
            
            
          
  Add this suggestion to a batch that can be applied as a single commit.
  This suggestion is invalid because no changes were made to the code.
  Suggestions cannot be applied while the pull request is closed.
  Suggestions cannot be applied while viewing a subset of changes.
  Only one suggestion per line can be applied in a batch.
  Add this suggestion to a batch that can be applied as a single commit.
  Applying suggestions on deleted lines is not supported.
  You must change the existing code in this line in order to create a valid suggestion.
  Outdated suggestions cannot be applied.
  This suggestion has been applied or marked resolved.
  Suggestions cannot be applied from pending reviews.
  Suggestions cannot be applied on multi-line comments.
  Suggestions cannot be applied while the pull request is queued to merge.
  Suggestion cannot be applied right now. Please check back later.
  
    
  
    
Overview
This PR introduces hybrid parallelism by integrating OpenMP into the existing MPI-based code. Computationally intensive collision and streaming parts were parallelised with OpenMP loops with the intention to better exploit shared-memory parallelism within nodes.
Enabling OpenMP is configurable via
-DHEMELB_USE_OPENMP=ON/OFFbuild option. OpenMP is disabled by default.Results
The pure MPI reference implementation consistently delivers the best performance and scalability across all tested configurations, compilers and platforms. However, at low node counts, the OpenMP version shows promising results, slightly outperforming the pure MPI version. That suggests that potentially, on a larger input geometry with more lattice sites per rank (more iterations for the OpenMP loops), it could still be beneficial to use OpenMP.
For full performance comparison please find the plots below.
ARCHER2
Figure 1: Hybrid parallelism: speedup for the retina dataset (40,000 time steps) on ARCHER2 using GNU compilers, 128 execution units per node.
Figure 2: Hybrid parallelism: speedup for the retina dataset (40,000 time steps) on ARCHER2 using Cray compilers, 128 execution units per node.
Figure 3: Hybrid Parallelism: simulation time on 4 nodes on ARCHER2 using GNU compilers, 128 execution units per node.
Cirrus
Figure 4: Hybrid parallelism: speedup for the retina dataset (40,000 time steps) on Cirrus using GNU compilers, 128 execution units per node.