Description
Thank you for taking the time to submit an issue!
Background information
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
OpenMPI v4.1.2
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
Compiled from release source - with UCX
If you are building/installing from a git clone, please copy-n-paste the output from git submodule status
.
Please describe the system on which you are running
- Operating system/version: Rocky Linux 8
- Computer hardware: Gadi supercomputer, 48 cores per node See here for more details: https://nci.org.au/our-systems/hpc-systems
- Network type: IB
Details of the problem
Please describe, in detail, the problem that you are having, including the behavior you expect to see, the actual behavior that you are seeing, steps to reproduce the problem, etc. It is most helpful if you can attach a small program that a developer can use to reproduce your problem.
Problem I'm trying to solve: Halo exchange of ocean model running on unstructured mesh
Solution 1: Was using RMA, each partition process would:
MPI_Fence(..)
MPI_Get()
MPI_Fence()
Got terrible performance in scaling, basically as the number of partitions increased, the number of total exchanges did not decrease much and the number of exchanges between partitions got smaller. Tried all sorts of sync'ing methods, eventually advice from the HPC folks was, do not use RMA
Solution 2: Switched to using MPI_Sendrecv() - performance increased 10 fold! i.e. total time for comms
BUT, this was in my test program. While the comms time in my full application did improve "most of the time", every now and then the Sendrecv() would take many secs to complete. So timing would look something like this:
fill_3d_w2w : 3.64980
fill_3d_w2w : 1.35756
fill_3d_w2w : 0.01945
fill_3d_w2w : 0.01938
fill_3d_w2w : 0.01928
fill_3d_w2w : 0.01969
fill_3d_w2w : 9.61830
fill_3d_w2w : 0.01991
fill_3d_w2w : 0.01956
fill_3d_w2w : 0.01946
fill_3d_w2w : 0.01933
fill_3d_w2w : 0.01984
fill_3d_w2w : 0.01907
fill_3d_w2w : 0.01945
fill_3d_w2w : 0.01974
fill_3d_w2w : 0.01916
fill_3d_w2w : 0.38533
fill_3d_w2w : 7.96889
fill_3d_w2w : 0.01937
fill_3d_w2w : 0.01916
fill_3d_w2w : 0.01936
fill_3d_w2w : 0.01932
fill_3d_w2w : 0.01008
fill_3d_w2w : 0.01040
fill_3d_w2w : 0.01947
fill_3d_w2w : 0.02222
fill_3d_w2w : 0.01956
fill_3d_w2w : 0.01943
fill_3d_w2w : 0.01919
fill_3d_w2w : 0.01946
fill_3d_w2w : 0.00461
fill_3d_w2w : 11.92141
fill_3d_w2w : 0.02010
There is an MPI_Barrier just before the comms, and I'm wrapping gettimeofday to get the timing. Have tried running, ISends, with Recv's and even created a Dist_graph and using MPI_Neighbor_alltoallw - all with similar results.
While MPI_Get was slower, it was always consistent/stable. i.e. timing for each iteration was very similar. Of course, load imbalance can always be an issue but here I'm just doing a tic/toc around the comms, with a Barrier just before.
Note also, this is all on the same node.
Any ideas?