Skip to content

MPI-RMA - performance issues with MPI_Get #10573

Open
@thomasgillis

Description

@thomasgillis

Background information

my application relies on several calls to MPI_Get (a few hundreds per sync calls, like 200-600) with messages of small sizes (64 bytes to 9k roughly).
I observe a very strong performance decrease when going from one node to multiple nodes.

This issues relates to the comment of @bosilca here

There seems to be a performance issue with the one-sided support in UCX. I used the OSU get_bw benchmark, with all types of synchronizations (fence, flush, lock/unlock) and while there is some variability the performance is consistently at a fraction of the point-to-point performance (about 1/100). Even switching the RMA support over TCP is about 20 times faster (mpirun -np 2 --map-by node --mca pml ob1 --mca osc rdma --mca btl_t_if_include ib0 ../pt2pt/osu_bw).

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

OpenMPI 4.1.4 + UCX 1.12.1 but the issue is similar on OpenMPI 4.1.2 with ugni

OpenMPI 4.1.2 with ugni

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

  • ompi4.14 with easybuild with ucx 1.12.1 and ofi 1.14.0
  • openmpi 4.1.2 (by the cori support team).

Please describe the system on which you are running

  • openmpi 4.1.4 runs on Infiniband HDR 200Gbps, with large nodes (128 cores/nodes)
  • openmpi 4.1.2 runs on cray network

Details of the problem

application issue

EDIT: The issues on the IB cluster are solved now thanks to the support team

On the cray cluster, using a weak scaling approach (5.3M unknowns per rank) the MPI_Gets time goes from 0.7308 sec on 1 node to 17.6264 sec on 8 nodes (for the same part of the code).

Similar results are observed on the IB cluster (10M unknowns per rank) where on a single node the average bandwidth measured is 260-275 Mb/s while 8 nodes are down to 210-220Mb/s (the theoretical bandwidth is 200Gb/s).
From a timing perspective, the MPI_Get calls experience a more "normal" increase of the computational time from 1.0665 sec to 01.2820 secs.

Those numbers have been obtained using MPI_Win_allocate and MPI_create_hvector datatypes. In a previous version of the code using MPI_Win_create the one node case used to be as slow as the 8 nodes ones.

osu benchmarks - IB network

Following previous comments I have also run the OSU benchmark osu_get_bw for several number of calls per synchronization and the different memory allocation (see below). I compare the bandwidth measured between 2 ranks on the same node or on different nodes. Both cases barely make it to 25Gb/s while the network is supposed to deliver 200Gb/s.

questions

  • on the cray network: how can I reduce the performance loss
  • on the IB network: while the performances seem reasonable, I am confused by the measure bandwidth (both osu and real-life application). Is there any good reason for the measured bandwidth to be so low?

other related questions:

  • what is the expected influence of MPI_Alloc_mem on performances for IB networks? are the gain specific to RMA or is it better for every MPI calls?
  • what is the influence of export OMPI_MCA_pml_ucx_multi_send_nb=1? It's set to 0 by default on my configuration.

At this stage it's not clear to me if there is indeed a performance issue or if it's the best the implementation can do
Also maybe the configuration is not appropriate for the use we have of MPI-RMA.

I will be happy to try any suggestion you might have.
Thanks for your help!

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions