OpenACC (GPU Porting) #26

aroccon · 2022-03-30T15:35:23Z

aroccon
Mar 30, 2022
Maintainer

Log of the GPU Porting with problems (and solutions sometimes)

Dear All,
I'm currently working on the GPU porting of the code with openACC.
At the moment, FFTx, FFTy and DCTz are performed using cuFFT libraries (machine=15 and openaccflag=1).

Being a directive-based language, in all the other machines, the compiler will ignore the directives and compile as usual.
However, due to memory management reasons, I had to modify the fft*_fwd.90, fft*bwd.f90, dctz.f90.
In particular, I modified how the arrays are passed between the subroutines (from physical to spectral to fft and from spectral_to_physical and back). Also, these subroutines are now included inside a module (so to avoid the communication of the sizes of the arrays between subroutines). This modification is in line with the new Fortran90 style of sending arrays (see
https://w3.pppl.gov/~hammett/comp/f90_arrays.txt).

Both gfortran and nvfotran (ex PGI) seem to not have problems with this modification.
However, as I cannot test all the possible flow configurations and machines, please report any problem or possible issue in this Thread (or via e-mail to me).

Some remarks on the cuFFT implementation of the FFTx, FFTy and DCTz:

-cuFFT library has plans similar to FFTW, everything works fine without major issues for the x and y directions. No problems with the creation of plans, all transforms can be computed in a block (also because the transform direction are the first and the last and comply with the advanced data layout)

-DCT (Real to Real transform), this type of transform is not implemented in the cuFFT library. A possible solution (for simplicity we consider a single vector of complex number having length nz) is the following: 1) make the vector even symmetric (new lenght is 2*(nz-1); 2) compute the FFT (real to complex) of the real part and imaginary part separately 3) the result will be a complex vector of lenght nz, 4) the final result will be obtained by combining the real parts of the real and imaginary input.

-DCT is performed along the column direction, same as with FFTW, it is not possible to compute all the DCTz in one shot keeping the standard matrix configuration. It can be only done by rows (as before) or by slices.
Both approaches are very slow (lots of sync, slices are generally faster but still not enough), solution currently employed is to transpose the matrix and then do the DCT (in one shot) along the 1st direction. With this solution the mani bottleneck are the transposition (90% of the time). DCT (1d) for vector has been also implemented (required in sim_check and , the wfd one in statitistics.f90 is currently deactivated).
For info on this point check: https://stackoverflow.com/questions/26918101/1d-ffts-of-columns-and-rows-of-a-3d-matrix-in-cuda

M100 (27/05/2022):

Do not use in the makefile: LIBS = -IS(CUDA_INC) -LS(CUDA_LIB) -lcufft
But instead use: LIBS = -cudalib=cufft

With the first library linking, there can be a possible mismatch between CUDA and cuFFT versions (with HPC-SDK 2021) that leads to erroneous results.
The second way is simpler and there is no need to load the cuda module (just hpc-sdk).

Compiling options for nvfortan:
-Mconcur: Instructs the compiler to enable auto-concurrentization of loops
Marginally faster on Intel (Tersicore) but slower on IBM (M100)

M100 (29/07/2022):

Solver and Transforms entirely on the GPUs (version 8).
Most of the time is lost on convective_ns.f90 (especially in the MPI transpostions).
The situation has largely improved porting also some sections of 2 to the GPU.
Only MPI communications pass through CPU.

M100 (03/08/20229):
CUDA-aware has been implemented, nice speed-up the code.
The current bottleneck is the MPI communications (20% is computational time).

MPI Libraries.
-OpenMPI, just load hpc-sdk (built in openMPI with CUDA-awareness).
Single-node performance are not so good, but works fine in multi node (scalability is ok).
4 MPI Tasks on a single node: 0.28s/time step (256^3, single-phase solver)
8 MPI Tasks on two nodes: 0.20s/time step (256^3, single-phase solver)

-Spectrum MPI, load first hpc-sdk/2021--binary and then spectrum_mpi/10.4.0--binary (otherwise mpipgifort does not work).
Single-node performance are very good, but problems when multi-node.
4 MPI Tasks on a single node: 0.11s/time step (256^3, single-phase solver)
8 MPI Tasks on two nodes: 0.40s/time step (256^3, single-phase solver) with large fluctuations (20 seconds?)

M110 (08/09/2022):
Note on MPI Spectrum.
GPU-awareness does not seem to work fine with more than one node performance-wise (very high time elapsed per time step).
With Spectrum MPI, all MPI calls will use GPU buffer, independent of the !$acc host_data directive.
While, with openMPI, !$acc host_data is required to enable GPU buffers use.
On M100, MPI Spectrum (without -gpu flag) is faster than openMPI.

M100 (08/09/2022)
MPI improvement:
In the subroutine yz2xy.f90, there is a possible solution for the use of non-blocking MPI communications (Isend+Irecv+waitall).
This should be the best solution (everything is async).
Performance improvement on small number of nodes is marginal (-5%).
Must be tested with a large number of nodes and fully validated.

M100+ (15/09/2022)
A long/short note on MPI Spectrum.
MPI Spectrum works well only when memory is not managed (pinned memory).
However, the use however of pinned memory (higher bandwith) requires manual data management, which is rather complex with the current code (subroutine are very stratified and it is difficult to keep track of the memory usage and optimize it).
Simpe copyin/out after kernels make the code much slower (x4 slower) and allocations/dellocatons must be carefully tracked.
Also the structured !$acc data + !$acc end data is not really flexible and the use of unstructured directives (enter data + exit data and possible async) is not very easy.
At the moment, i give up on pinned memory, managed memory is a bit simpler to implemnet, especially if modificatons of the code are required (with managed memory is very easy and not so badly optimized).

M100+ (27/09/2022)
The bufs and bufr can be defined as pinned (even if the gpu=managed is enabled). Also -cuda must be enabled in this case.
On Nsight System, the P2P communication (GPU to GPU) is clearly visible (not anymore page fault).
Using this trick, -gpu flag can be used using MPI Spectrum, problem discussed before seems gone.

There is another problem, allocation and deallocation of pinned memory is very expensive.
Bufr and bufs must be alloctaed during inizializaton, doing so (first test, with manual sizing of the arrays), scalability improves and the overhead of the allocation diseappear. Some work is however required to do this.
In general, performance are very good as well scalability.

M100+ (06/06/2023).
Trying to move the code from mpi to mpi_f08, I found out something very interesting: even using the openMPI library included in the hpc-sdk 2023 (just this package loaded), I can match the performance of the MPI Spectrum library.
This simplifies a lot code development. This is version 11a (preallocated buffers) so this is maybe the reason?..

Leonardo (25/08/2023).
Last release of FLOW36 ported on Leonardo with success (NVHPC 23.1 + openmpi).
Compileris mpif90 (nvfortran + openMPI), don't know why it's not included.
Performance looking very good, classic benchamrk (256^3): 85 ms (without any code modifications).
Performance looking very good, classic benchamrk (256^3): 40 ms (pinned memory trick on bufs and bufr).
This is impressive, is almost a 2.2 x speed-up compared to M100.
MPI_F08 and use mpi does not seem to make a big difference (as expected).

aroccon · 2024-07-23T09:11:56Z

aroccon
Jul 23, 2024
Maintainer Author

23/07/2024
Parallelization of the LPT on GPUs.

Main issues:
The outermost loop on particles cannot be directly be parallelized because of the call to other subroutines (that are not kernels and thus cannot be directly called from the GPU).
These subroutines can be defined as !$acc routine seq and called by parallel regions.
Some modifications are required to compile these subroutine as !$acc routine seq:
-Minloc is not supported (workaround is required)
-Variables used in the subroutines must be pre-declared using !$acc declare create(list)

The computational cost of the LPT is small.
For 1 million of particles, it is maybe not worth to try to offload to GPUs.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiphase Flow Laboratory

OpenACC (GPU Porting) #26

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Multiphase Flow Laboratory

OpenACC (GPU Porting) #26

Uh oh!

Uh oh!

aroccon Mar 30, 2022 Maintainer

Replies: 1 comment

Uh oh!

Uh oh!

aroccon Jul 23, 2024 Maintainer Author

aroccon
Mar 30, 2022
Maintainer

aroccon
Jul 23, 2024
Maintainer Author