Description
Background information
my application relies on several calls to MPI_Get
(a few hundreds per sync calls, like 200-600) with messages of small sizes (64 bytes to 9k roughly).
I observe a very strong performance decrease when going from one node to multiple nodes.
This issues relates to the comment of @bosilca here
There seems to be a performance issue with the one-sided support in UCX. I used the OSU get_bw benchmark, with all types of synchronizations (fence, flush, lock/unlock) and while there is some variability the performance is consistently at a fraction of the point-to-point performance (about 1/100). Even switching the RMA support over TCP is about 20 times faster (
mpirun -np 2 --map-by node --mca pml ob1 --mca osc rdma --mca btl_t_if_include ib0 ../pt2pt/osu_bw
).
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
OpenMPI 4.1.4 + UCX 1.12.1 but the issue is similar on OpenMPI 4.1.2 with ugni
OpenMPI 4.1.2 with ugni
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
ompi4.14 with easybuild with ucx 1.12.1 and ofi 1.14.0- openmpi 4.1.2 (by the cori support team).
Please describe the system on which you are running
openmpi 4.1.4 runs on Infiniband HDR 200Gbps, with large nodes (128 cores/nodes)- openmpi 4.1.2 runs on cray network
Details of the problem
application issue
EDIT: The issues on the IB cluster are solved now thanks to the support team
On the cray cluster, using a weak scaling approach (5.3M unknowns per rank) the MPI_Get
s time goes from 0.7308 sec on 1 node to 17.6264 sec on 8 nodes (for the same part of the code).
Similar results are observed on the IB cluster (10M unknowns per rank) where on a single node the average bandwidth measured is 260-275
Mb/s while 8 nodes are down to 210-220
Mb/s (the theoretical bandwidth is 200Gb/s).
From a timing perspective, the MPI_Get calls experience a more "normal" increase of the computational time from 1.0665
sec to 01.2820
secs.
Those numbers have been obtained using MPI_Win_allocate
and MPI_create_hvector
datatypes. In a previous version of the code using MPI_Win_create
the one node case used to be as slow as the 8 nodes ones.
osu benchmarks - IB network
Following previous comments I have also run the OSU benchmark osu_get_bw
for several number of calls per synchronization and the different memory allocation (see below). I compare the bandwidth measured between 2 ranks on the same node or on different nodes. Both cases barely make it to 25Gb/s while the network is supposed to deliver 200Gb/s.
questions
- on the cray network: how can I reduce the performance loss
on the IB network: while the performances seem reasonable, I am confused by the measure bandwidth (both osu and real-life application). Is there any good reason for the measured bandwidth to be so low?
other related questions:
- what is the expected influence of
MPI_Alloc_mem
on performances for IB networks? are the gain specific to RMA or is it better for every MPI calls? - what is the influence of
export OMPI_MCA_pml_ucx_multi_send_nb=1
? It's set to0
by default on my configuration.
At this stage it's not clear to me if there is indeed a performance issue or if it's the best the implementation can do
Also maybe the configuration is not appropriate for the use we have of MPI-RMA.
I will be happy to try any suggestion you might have.
Thanks for your help!