Skip to content

MPI 3.1/4 – how can progress be ensured with passive-target synchronization? #28

@csubich

Description

@csubich

Greetings,

I initially interpreted the below issue as a problem within the Intel MPI implementation, but after posting on their community forums1 Intel confirms that this is allowable behaviour per their interpretation. Either I'm misinterpreting the specification (and thus doing something undefined), Intel is wrong, or the specification is ambiguous.

Problem

Somewhat recently, I was flummoxed by a deadlock in MPI code that used passive-target synchronization. A local process would spin-wait on a variable in a shared-memory window (using local load and MPI_Win_sync), and a remote process would (eventually) update that variable with MPI_Fetch_and_op.

The expected result was… well, progress. In fact, the Intel MPI implementation would reliably deadlock when the RMA operation involved communication over an interconnect (i.e. something more than a local shared memory access). Sample code is as follows:

#include <mpi.h>
#include <stdio.h>

int main(int argc, char ** argv) {
    MPI_Init(&argc, &argv); // Initialize MPI
    int rank, nproc;
    // Get MPI rank and world size
    MPI_Comm_rank(MPI_COMM_WORLD,&rank);
    MPI_Comm_size(MPI_COMM_WORLD,&nproc);

    int * rma_memory; // RMA memory (to be allocated)
    MPI_Win rma_window;
    MPI_Win_allocate(sizeof(int),1,MPI_INFO_NULL,MPI_COMM_WORLD,&rma_memory,&rma_window);

    // Get and display memory model for window
    int *memory_model, flag;
    MPI_Win_get_attr(rma_window, MPI_WIN_MODEL, &memory_model, &flag);
    if (*memory_model == MPI_WIN_UNIFIED) {
        printf("Rank %d created RMA window with the unified memory model\n",rank);
    } else if (*memory_model == MPI_WIN_SEPARATE) {
        printf("Rank %d created RMA window with the separate memory model\n",rank);
    } else {
        printf("Rank %d created RMA window with an unknown memory model(???)\n",rank);
    }

    *rma_memory = 0; // Initialize to zero

    // All processes will lock the window
    MPI_Win_lock_all(MPI_MODE_NOCHECK,rma_window);

    if (rank == 0) { 
        // Rank 0: wait for rank 1 to enter its spinlock, then use MPI_Fetch_and_op to increment
        // *rma_memory at rank 1

        // Receive zero-byte message indicating that rank 1 is ready to enter its spinlock
        MPI_Recv(0,0,MPI_BYTE,1,0,MPI_COMM_WORLD,MPI_STATUS_IGNORE);

        // Wait a further 0.1s so that rank 1 should have assuredly completed any progress-making
        // MPI calls
        double tic = MPI_Wtime(); 
        while (MPI_Wtime() - tic < 0.1); 

        tic = MPI_Wtime(); // Reset tic value to account for delay

        // Perform fetch-and-op
        int one = 1; 
        int result = -1;
        MPI_Fetch_and_op(&one, &result, MPI_INT, 1, 0, MPI_SUM, rma_window);

        // Flush the window to ensure completion
        MPI_Win_flush_all(rma_window); 

        printf("Rank 0: sent %d, received %d (should be 0) in %.2fms\n",one, result, (MPI_Wtime() - tic)*1e3);
    } else if (rank == 1) {
        // Rank 1: Send a message to rank 0 indicating readiness for Fetch_and_op
        MPI_Send(0,0,MPI_BYTE,0,0,MPI_COMM_WORLD);

        double tic = MPI_Wtime();

        // Spinlock waiting for '1' to be written to the RMA_Window
        while (*rma_memory != 1 && MPI_Wtime() - tic < 5) {
            // Separate memory model: synchronize remote and local copies of window
            // Unified memory model: memory barrier
            MPI_Win_sync(rma_window);
        }
        int old_value = *rma_memory;
        printf("Rank 1: Memory value %d (should be 1) in %.2fms\n",old_value,1e3*(MPI_Wtime()-tic-0.1));

        // Demonstrate forced progress
        MPI_Win_flush(1,rma_window); // Should be a no-op, since there are no pending RMA ops from this rank
        MPI_Win_sync(rma_window);
        if (old_value != *rma_memory) {
            printf("Rank 1: After flush, memory value is now %d (should be 1)\n",*rma_memory);
        }
    }

    MPI_Win_unlock_all(rma_window);
    MPI_Win_free(&rma_window);
    MPI_Finalize();
    return 0;
}

The problem is visible even on a single node of a cluster when the shared-memory interconnect is disabled:

$ mpirun -genv 'I_MPI_SHM=off' -np 2 ./a.out
Rank 0 created RMA window with the unified memory model
Rank 1 created RMA window with the unified memory model
Rank 1: Memory value 0 (should be 1) in 4900.00ms
Rank 1: After flush, memory value is now 1 (should be 1)
Rank 0: sent 1, received 0 (should be 0) in 4900.14ms

The root problem appears to be that rank 0 is waiting on the assistance of the rank 1 to complete the Fetch part of Fetch_and_op, but the MPI_Win_sync inside the spinlock on rank 1 does not engage the MPI progress engine.

Per the specification, I think that this behaviour is surprising if not spec-noncompliant. Per §12.7 of the 4.1 draft (atomicity isn't the problem here):

U2. Accessing a location in the window that is also the target of a remote update is valid (not erroneous) but the precise result will depend on the behavior of the implementation. Updates from an origin process will appear in the memory of the target, but there are no atomicity or ordering guarantees if more than one byte is updated. Updates are stable in the sense that once data appears in memory of the target, the data remains until replaced by another update. This permits polling on a location for a change from zero to nonzero or for a particular value, but not polling and comparing the relative magnitude of values.

Replacing the Fetch_and_op call on rank 0 with separate MPI_Get and MPI_Put calls does function properly, without deadlock, even if it has ambiguous correctness (I'm not sure about the combination of Get and Put with the same RMA target) and is absolutely erroneous in the general case of multiple writers.

Proposal

Prior to the MPI 4.1 draft, I would have asked that the call to Win_sync engage the progress engine even in the unified memory model, but that's now explicitly not required (p608). I'm not sure what the required change now is, if this deadlock is not in fact an implementation bug.

The trouble seems to be twofold:

  • It's not obvious that the target of some RMA memory operations like Fetch_and_op must actively participate (via progress), even if the window has passive-target synchronization, and
  • There's no obvious call to ensure such progress. Win_flush works in Intel MPI, but it's an unnatural fit because the target rank in general may have no idea what other rank is executing the atomic update. Additionally, the flush calls might implicitly cause communication even if there's nothing to do, and that's unnecessary in this case where the target rank certainly 'knows' if it needs to do something with respect to an atomic RMA operation.
    • In the Intel MPI implementation, other calls that might seemingly force progress don't. For example, a call to MPI_Test with MPI_REQUEST_NULL doesn't force progress. Rank 1 also can't force progress by sending a zero-length message to itself.
    • OpenMPI gave me odd results my testing system, but I can't guarantee that it was configured correctly.
    • In general, progress is guaranteed "while blocked on an MPI call," but with passive-target synchronization there's no obviously correct, minimal way to become blocked on an MPI call.

Changes to the Text

Again presuming this behaviour is intended or allowable:

  • Document which calls might require the target's active participation, even in a passive-target epoch
  • (More controversially?) Add MPI_Engage_progress() to the API as the minimal call that will tell the MPI implementation to make progress on any outstanding requests with no further semantic intent.

Impact on Implementations

Implementations will likely complain about any specification-guaranteed background progress. In the Intel forum thread linked above, the Intel representative closed the issue (then presented as a question about progress during Win_sync) because Win_sync was simply a memory barrier.

Impact on Users

At minimum, black-and-white documentation about participation requirements would have saved me, a user, a few headaches trying to distill the unexpected deadlock into a minimal example. (The original case involved Fetch_and_op to one location, then an MPI_Put of a status flag to a second; the latter was spinwaited upon, blocking completion of the original Fetch_and_op.)

References and pull requests:

Footnotes

  1. (https://community.intel.com/t5/Intel-oneAPI-HPC-Toolkit/Intel-MPI-fails-to-ensure-progress-for-one-sided-operations/m-p/1519581)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions