Skip to content

Commit da7e957

Browse files
committed
Prevent deadlock after an unmanaged error in MPI_SENDRECV
Issue #9160 Signed-off-by: Aurelien Bouteiller <bouteill@icl.utk.edu>
1 parent 0fd550d commit da7e957

File tree

1 file changed

+14
-5
lines changed

1 file changed

+14
-5
lines changed

ompi/mpi/c/sendrecv.c

Lines changed: 14 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -93,12 +93,21 @@ int MPI_Sendrecv(const void *sendbuf, int sendcount, MPI_Datatype sendtype,
9393
rc = MCA_PML_CALL(send(sendbuf, sendcount, sendtype, dest,
9494
sendtag, MCA_PML_BASE_SEND_STANDARD, comm));
9595
#if OPAL_ENABLE_FT_MPI
96-
/* If ULFM is enabled we need to wait for the posted receive to
97-
* complete, hence we cannot return here */
98-
rcs = rc;
99-
#else
96+
if (OPAL_UNLIKELY(MPI_ERR_PROC_FAILED == rc)) {
97+
/* If this is a recoverable error (e.g., ULFM error class),
98+
* we need to wait for the posted receive to complete so that the
99+
* receive buffer doesn't get updated after the completion of the call.
100+
* Hence we cannot return immediately, we need to wait on the recv
101+
* req first. */
102+
rcs = rc;
103+
}
104+
else /* else intentionally spills outside ifdef */
105+
#endif
106+
/* If the error semantic does not garantee the completion of the wait on
107+
* the recv-req for that error class, we just invoke the errhandler asap
108+
* to avoid hanging. Note that in this case we are returning the recv
109+
* buffer in an undefined state and the application may not recover. */
100110
OMPI_ERRHANDLER_CHECK(rc, comm, rc, FUNC_NAME);
101-
#endif /* OPAL_ENABLE_FT_MPI */
102111
}
103112

104113
if (source != MPI_PROC_NULL) { /* wait for recv */

0 commit comments

Comments
 (0)