-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Greetings,
I initially interpreted the below issue as a problem within the Intel MPI implementation, but after posting on their community forums1 Intel confirms that this is allowable behaviour per their interpretation. Either I'm misinterpreting the specification (and thus doing something undefined), Intel is wrong, or the specification is ambiguous.
Problem
Somewhat recently, I was flummoxed by a deadlock in MPI code that used passive-target synchronization. A local process would spin-wait on a variable in a shared-memory window (using local load and MPI_Win_sync
), and a remote process would (eventually) update that variable with MPI_Fetch_and_op
.
The expected result was… well, progress. In fact, the Intel MPI implementation would reliably deadlock when the RMA operation involved communication over an interconnect (i.e. something more than a local shared memory access). Sample code is as follows:
#include <mpi.h>
#include <stdio.h>
int main(int argc, char ** argv) {
MPI_Init(&argc, &argv); // Initialize MPI
int rank, nproc;
// Get MPI rank and world size
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
MPI_Comm_size(MPI_COMM_WORLD,&nproc);
int * rma_memory; // RMA memory (to be allocated)
MPI_Win rma_window;
MPI_Win_allocate(sizeof(int),1,MPI_INFO_NULL,MPI_COMM_WORLD,&rma_memory,&rma_window);
// Get and display memory model for window
int *memory_model, flag;
MPI_Win_get_attr(rma_window, MPI_WIN_MODEL, &memory_model, &flag);
if (*memory_model == MPI_WIN_UNIFIED) {
printf("Rank %d created RMA window with the unified memory model\n",rank);
} else if (*memory_model == MPI_WIN_SEPARATE) {
printf("Rank %d created RMA window with the separate memory model\n",rank);
} else {
printf("Rank %d created RMA window with an unknown memory model(???)\n",rank);
}
*rma_memory = 0; // Initialize to zero
// All processes will lock the window
MPI_Win_lock_all(MPI_MODE_NOCHECK,rma_window);
if (rank == 0) {
// Rank 0: wait for rank 1 to enter its spinlock, then use MPI_Fetch_and_op to increment
// *rma_memory at rank 1
// Receive zero-byte message indicating that rank 1 is ready to enter its spinlock
MPI_Recv(0,0,MPI_BYTE,1,0,MPI_COMM_WORLD,MPI_STATUS_IGNORE);
// Wait a further 0.1s so that rank 1 should have assuredly completed any progress-making
// MPI calls
double tic = MPI_Wtime();
while (MPI_Wtime() - tic < 0.1);
tic = MPI_Wtime(); // Reset tic value to account for delay
// Perform fetch-and-op
int one = 1;
int result = -1;
MPI_Fetch_and_op(&one, &result, MPI_INT, 1, 0, MPI_SUM, rma_window);
// Flush the window to ensure completion
MPI_Win_flush_all(rma_window);
printf("Rank 0: sent %d, received %d (should be 0) in %.2fms\n",one, result, (MPI_Wtime() - tic)*1e3);
} else if (rank == 1) {
// Rank 1: Send a message to rank 0 indicating readiness for Fetch_and_op
MPI_Send(0,0,MPI_BYTE,0,0,MPI_COMM_WORLD);
double tic = MPI_Wtime();
// Spinlock waiting for '1' to be written to the RMA_Window
while (*rma_memory != 1 && MPI_Wtime() - tic < 5) {
// Separate memory model: synchronize remote and local copies of window
// Unified memory model: memory barrier
MPI_Win_sync(rma_window);
}
int old_value = *rma_memory;
printf("Rank 1: Memory value %d (should be 1) in %.2fms\n",old_value,1e3*(MPI_Wtime()-tic-0.1));
// Demonstrate forced progress
MPI_Win_flush(1,rma_window); // Should be a no-op, since there are no pending RMA ops from this rank
MPI_Win_sync(rma_window);
if (old_value != *rma_memory) {
printf("Rank 1: After flush, memory value is now %d (should be 1)\n",*rma_memory);
}
}
MPI_Win_unlock_all(rma_window);
MPI_Win_free(&rma_window);
MPI_Finalize();
return 0;
}
The problem is visible even on a single node of a cluster when the shared-memory interconnect is disabled:
$ mpirun -genv 'I_MPI_SHM=off' -np 2 ./a.out
Rank 0 created RMA window with the unified memory model
Rank 1 created RMA window with the unified memory model
Rank 1: Memory value 0 (should be 1) in 4900.00ms
Rank 1: After flush, memory value is now 1 (should be 1)
Rank 0: sent 1, received 0 (should be 0) in 4900.14ms
The root problem appears to be that rank 0 is waiting on the assistance of the rank 1 to complete the Fetch
part of Fetch_and_op
, but the MPI_Win_sync
inside the spinlock on rank 1 does not engage the MPI progress engine.
Per the specification, I think that this behaviour is surprising if not spec-noncompliant. Per §12.7 of the 4.1 draft (atomicity isn't the problem here):
U2. Accessing a location in the window that is also the target of a remote update is valid (not erroneous) but the precise result will depend on the behavior of the implementation. Updates from an origin process will appear in the memory of the target, but there are no atomicity or ordering guarantees if more than one byte is updated. Updates are stable in the sense that once data appears in memory of the target, the data remains until replaced by another update. This permits polling on a location for a change from zero to nonzero or for a particular value, but not polling and comparing the relative magnitude of values.
Replacing the Fetch_and_op
call on rank 0 with separate MPI_Get
and MPI_Put
calls does function properly, without deadlock, even if it has ambiguous correctness (I'm not sure about the combination of Get
and Put
with the same RMA target) and is absolutely erroneous in the general case of multiple writers.
Proposal
Prior to the MPI 4.1 draft, I would have asked that the call to Win_sync
engage the progress engine even in the unified memory model, but that's now explicitly not required (p608). I'm not sure what the required change now is, if this deadlock is not in fact an implementation bug.
The trouble seems to be twofold:
- It's not obvious that the target of some RMA memory operations like
Fetch_and_op
must actively participate (via progress), even if the window has passive-target synchronization, and - There's no obvious call to ensure such progress.
Win_flush
works in Intel MPI, but it's an unnatural fit because the target rank in general may have no idea what other rank is executing the atomic update. Additionally, theflush
calls might implicitly cause communication even if there's nothing to do, and that's unnecessary in this case where the target rank certainly 'knows' if it needs to do something with respect to an atomic RMA operation.- In the Intel MPI implementation, other calls that might seemingly force progress don't. For example, a call to
MPI_Test
withMPI_REQUEST_NULL
doesn't force progress. Rank 1 also can't force progress by sending a zero-length message to itself. - OpenMPI gave me odd results my testing system, but I can't guarantee that it was configured correctly.
- In general, progress is guaranteed "while blocked on an MPI call," but with passive-target synchronization there's no obviously correct, minimal way to become blocked on an MPI call.
- In the Intel MPI implementation, other calls that might seemingly force progress don't. For example, a call to
Changes to the Text
Again presuming this behaviour is intended or allowable:
- Document which calls might require the target's active participation, even in a passive-target epoch
- (More controversially?) Add
MPI_Engage_progress()
to the API as the minimal call that will tell the MPI implementation to make progress on any outstanding requests with no further semantic intent.
Impact on Implementations
Implementations will likely complain about any specification-guaranteed background progress. In the Intel forum thread linked above, the Intel representative closed the issue (then presented as a question about progress during Win_sync
) because Win_sync
was simply a memory barrier.
Impact on Users
At minimum, black-and-white documentation about participation requirements would have saved me, a user, a few headaches trying to distill the unexpected deadlock into a minimal example. (The original case involved Fetch_and_op
to one location, then an MPI_Put
of a status flag to a second; the latter was spinwaited upon, blocking completion of the original Fetch_and_op
.)