MPI 3.1/4 – how can progress be ensured with passive-target synchronization?

Greetings,

I initially interpreted the below issue as a problem within the Intel MPI implementation, but after posting on their community forums[^1] Intel confirms that this is allowable behaviour per their interpretation.  Either I'm misinterpreting the specification (and thus doing something undefined), Intel is wrong, or the specification is ambiguous.

# Problem

Somewhat recently, I was flummoxed by a deadlock in MPI code that used passive-target synchronization.  A local process would spin-wait on a variable in a shared-memory window (using local load and `MPI_Win_sync`), and a remote process would (eventually) update that variable with `MPI_Fetch_and_op`.

The expected result was… well, progress.  In fact, the Intel MPI implementation would reliably deadlock when the RMA operation involved communication over an interconnect (i.e. something more than a local shared memory access).  Sample code is as follows:

```c
#include <mpi.h>
#include <stdio.h>

int main(int argc, char ** argv) {
    MPI_Init(&argc, &argv); // Initialize MPI
    int rank, nproc;
    // Get MPI rank and world size
    MPI_Comm_rank(MPI_COMM_WORLD,&rank);
    MPI_Comm_size(MPI_COMM_WORLD,&nproc);

    int * rma_memory; // RMA memory (to be allocated)
    MPI_Win rma_window;
    MPI_Win_allocate(sizeof(int),1,MPI_INFO_NULL,MPI_COMM_WORLD,&rma_memory,&rma_window);

    // Get and display memory model for window
    int *memory_model, flag;
    MPI_Win_get_attr(rma_window, MPI_WIN_MODEL, &memory_model, &flag);
    if (*memory_model == MPI_WIN_UNIFIED) {
        printf("Rank %d created RMA window with the unified memory model\n",rank);
    } else if (*memory_model == MPI_WIN_SEPARATE) {
        printf("Rank %d created RMA window with the separate memory model\n",rank);
    } else {
        printf("Rank %d created RMA window with an unknown memory model(???)\n",rank);
    }

    *rma_memory = 0; // Initialize to zero

    // All processes will lock the window
    MPI_Win_lock_all(MPI_MODE_NOCHECK,rma_window);

    if (rank == 0) { 
        // Rank 0: wait for rank 1 to enter its spinlock, then use MPI_Fetch_and_op to increment
        // *rma_memory at rank 1

        // Receive zero-byte message indicating that rank 1 is ready to enter its spinlock
        MPI_Recv(0,0,MPI_BYTE,1,0,MPI_COMM_WORLD,MPI_STATUS_IGNORE);

        // Wait a further 0.1s so that rank 1 should have assuredly completed any progress-making
        // MPI calls
        double tic = MPI_Wtime(); 
        while (MPI_Wtime() - tic < 0.1); 

        tic = MPI_Wtime(); // Reset tic value to account for delay

        // Perform fetch-and-op
        int one = 1; 
        int result = -1;
        MPI_Fetch_and_op(&one, &result, MPI_INT, 1, 0, MPI_SUM, rma_window);

        // Flush the window to ensure completion
        MPI_Win_flush_all(rma_window); 

        printf("Rank 0: sent %d, received %d (should be 0) in %.2fms\n",one, result, (MPI_Wtime() - tic)*1e3);
    } else if (rank == 1) {
        // Rank 1: Send a message to rank 0 indicating readiness for Fetch_and_op
        MPI_Send(0,0,MPI_BYTE,0,0,MPI_COMM_WORLD);

        double tic = MPI_Wtime();

        // Spinlock waiting for '1' to be written to the RMA_Window
        while (*rma_memory != 1 && MPI_Wtime() - tic < 5) {
            // Separate memory model: synchronize remote and local copies of window
            // Unified memory model: memory barrier
            MPI_Win_sync(rma_window);
        }
        int old_value = *rma_memory;
        printf("Rank 1: Memory value %d (should be 1) in %.2fms\n",old_value,1e3*(MPI_Wtime()-tic-0.1));

        // Demonstrate forced progress
        MPI_Win_flush(1,rma_window); // Should be a no-op, since there are no pending RMA ops from this rank
        MPI_Win_sync(rma_window);
        if (old_value != *rma_memory) {
            printf("Rank 1: After flush, memory value is now %d (should be 1)\n",*rma_memory);
        }
    }

    MPI_Win_unlock_all(rma_window);
    MPI_Win_free(&rma_window);
    MPI_Finalize();
    return 0;
}
```

The problem is visible even on a single node of a cluster when the shared-memory interconnect is disabled:
```shell
$ mpirun -genv 'I_MPI_SHM=off' -np 2 ./a.out
Rank 0 created RMA window with the unified memory model
Rank 1 created RMA window with the unified memory model
Rank 1: Memory value 0 (should be 1) in 4900.00ms
Rank 1: After flush, memory value is now 1 (should be 1)
Rank 0: sent 1, received 0 (should be 0) in 4900.14ms
```

The root problem appears to be that rank 0 is waiting on the assistance of the rank 1 to complete the `Fetch` part of `Fetch_and_op`, but the `MPI_Win_sync` inside the spinlock on rank 1 does not engage the MPI progress engine.  

Per the specification, I think that this behaviour is surprising if not spec-noncompliant.  Per §12.7 of the 4.1 draft (atomicity isn't the problem here):

> U2. Accessing a location in the window that is also the target of a remote update is valid (not erroneous) but the precise result will depend on the behavior of the implementation. Updates from an origin process will appear in the memory of the target, but there are no atomicity or ordering guarantees if more than one byte is updated. Updates are stable in the sense that once data appears in memory of the target, the data remains until replaced by another update. This permits polling on a location for a change from zero to nonzero or for a particular value, but not polling and comparing the relative magnitude of values.

Replacing the `Fetch_and_op` call on rank 0 with separate `MPI_Get` and `MPI_Put` calls does function properly, without deadlock, even if it has ambiguous correctness (I'm not sure about the combination of `Get` and `Put` with the same RMA target) and is absolutely erroneous in the general case of multiple writers.

# Proposal

Prior to the MPI 4.1 draft, I would have asked that the call to `Win_sync` engage the progress engine even in the unified memory model, but that's now explicitly not required (p608).  I'm not sure what the required change now is, if this deadlock is not in fact an implementation bug.

The trouble seems to be twofold:

* It's not obvious that the target of some RMA memory operations like `Fetch_and_op` must actively participate (via progress), even if the window has passive-target synchronization, and
* There's no obvious call to ensure such progress.  `Win_flush` works in Intel MPI, but it's an unnatural fit because the target rank in general may have no idea what other rank is executing the atomic update.  Additionally, the `flush` calls might implicitly cause communication even if there's nothing to do, and that's unnecessary in this case where the target rank certainly 'knows' if it needs to do something with respect to an atomic RMA operation.
  * In the Intel MPI implementation, other calls that might seemingly force progress don't.  For example, a call to `MPI_Test` with `MPI_REQUEST_NULL` doesn't force progress.  Rank 1 also can't force progress by sending a zero-length message to itself.
  * OpenMPI gave me odd results my testing system, but I can't guarantee that it was configured correctly.
  * In general, progress is guaranteed "while blocked on an MPI call," but with passive-target synchronization there's no obviously correct, minimal way to _become_ blocked on an MPI call.

# Changes to the Text

Again presuming this behaviour is intended or allowable:

* Document which calls might require the target's active participation, even in a passive-target epoch
* (More controversially?) Add `MPI_Engage_progress()` to the API as the minimal call that will tell the MPI implementation to make progress on any outstanding requests with no further semantic intent. 

# Impact on Implementations

Implementations will likely complain about any specification-guaranteed background progress.  In the Intel forum thread linked above, the Intel representative closed the issue (then presented as a question about progress during `Win_sync`) because `Win_sync` was simply a memory barrier.

# Impact on Users

At minimum, black-and-white documentation about participation requirements would have saved me, a user, a few headaches trying to distill the unexpected deadlock into a minimal example.  (The original case involved `Fetch_and_op` to one location, then an `MPI_Put` of a status flag to a second; the latter was spinwaited upon, blocking completion of the original `Fetch_and_op`.)

# References and pull requests:

[^1]:  (https://community.intel.com/t5/Intel-oneAPI-HPC-Toolkit/Intel-MPI-fails-to-ensure-progress-for-one-sided-operations/m-p/1519581)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

MPI 3.1/4 – how can progress be ensured with passive-target synchronization? #28

Problem

Proposal

Changes to the Text

Impact on Implementations

Impact on Users

References and pull requests:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

MPI 3.1/4 – how can progress be ensured with passive-target synchronization? #28

Description

Problem

Proposal

Changes to the Text

Impact on Implementations

Impact on Users

References and pull requests:

Footnotes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions