MPI-RMA - performance issues with `MPI_Get`

## Background information

my application relies on several calls to `MPI_Get` (a few hundreds per sync calls, like 200-600) with messages of small sizes (64 bytes to 9k roughly).
I observe a very strong performance decrease when going from one node to multiple nodes.


This issues relates to the comment of @bosilca [here](https://github.com/open-mpi/ompi/issues/9508#issuecomment-952532348)
> There seems to be a performance issue with the one-sided support in UCX. I used the OSU get_bw benchmark, with all types of synchronizations (fence, flush, lock/unlock) and while there is some variability the performance is consistently at a fraction of the point-to-point performance (about 1/100). Even switching the RMA support over TCP is about 20 times faster (`mpirun -np 2 --map-by node --mca pml ob1 --mca osc rdma --mca btl_t_if_include ib0 ../pt2pt/osu_bw`).


### What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

~~OpenMPI 4.1.4 + UCX 1.12.1 but the issue is similar on OpenMPI 4.1.2 with ugni~~

OpenMPI 4.1.2 with ugni 


### Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

- ~~ompi4.14 with easybuild with ucx 1.12.1 and ofi 1.14.0~~
- openmpi 4.1.2 (by the cori support team).


### Please describe the system on which you are running
- ~~openmpi 4.1.4 runs on Infiniband HDR 200Gbps, with large nodes (128 cores/nodes)~~
- openmpi 4.1.2 runs on cray network

-----------------------------

## Details of the problem

### application issue

EDIT: The issues on the IB cluster are solved now thanks to the support team

On the cray cluster, using a weak scaling approach (5.3M unknowns per rank) the `MPI_Get`s time goes from 0.7308 sec on 1 node to 17.6264 sec on 8 nodes (for the same part of the code).

~~Similar results are observed on the IB cluster (10M unknowns per rank) where on a single node the average bandwidth measured is `260-275` Mb/s while 8 nodes are down to `210-220`Mb/s (the theoretical bandwidth is 200Gb/s).
From a timing perspective, the MPI_Get calls experience a more "normal" increase of the computational time from `1.0665` sec to `01.2820` secs.~~

Those numbers have been obtained using `MPI_Win_allocate` and `MPI_create_hvector` datatypes. In a previous version of the code using `MPI_Win_create` the one node case used to be as slow as the 8 nodes ones.

### osu benchmarks - IB network

~~Following previous comments I have also run the OSU benchmark `osu_get_bw` for several number of calls per synchronization and the different memory allocation (see below). I compare the bandwidth measured between 2 ranks on the same node or on different nodes. Both cases barely make it to 25Gb/s while the network is supposed to deliver 200Gb/s.~~

### questions
- on the cray network: how can I reduce the performance loss
- ~~on the IB network: while the performances seem reasonable, I am confused by the measure bandwidth (both osu and real-life application). Is there any good reason for the measured bandwidth to be so low?~~

other related questions:
- what is the expected influence of `MPI_Alloc_mem` on performances for IB networks? are the gain specific to RMA or is it better for every MPI calls?
- what is the influence of `export OMPI_MCA_pml_ucx_multi_send_nb=1`? It's set to `0` by default on my configuration.

~~At this stage it's not clear to me if there is indeed a performance issue or if it's the best the implementation can do
Also maybe the configuration is not appropriate for the use we have of MPI-RMA.~~

I will be happy to try any suggestion you might have.
Thanks for your help!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

MPI-RMA - performance issues with `MPI_Get` #10573

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Please describe the system on which you are running

Details of the problem

application issue

osu benchmarks - IB network

questions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

MPI-RMA - performance issues with MPI_Get #10573

Description

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Please describe the system on which you are running

Details of the problem

application issue

osu benchmarks - IB network

questions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

MPI-RMA - performance issues with `MPI_Get` #10573