Skip to content

Commit 767135c

Browse files
committed
MTL OFI: Fix Deadlock in fi_cancel given completion during cancel
- If a message for a recv that is being cancelled gets completed after the call to fi_cancel, then the OFI mtl will enter a deadlock state waiting for ofi_req->super.ompi_req->req_status._cancelled which will never happen since the recv was successfully finished. - To resolve this issue, the OFI mtl now checks ofi_req->req_started to see if the request has been started within the loop waiting for the event to be cancelled. If the request is being completed, then the loop is broken and fi_cancel exits setting ofi_req->super.ompi_req->req_status._cancelled = false; Signed-off-by: Spruit, Neil R <neil.r.spruit@intel.com>
1 parent f3db153 commit 767135c

File tree

1 file changed

+3
-0
lines changed

1 file changed

+3
-0
lines changed

ompi/mca/mtl/ofi/mtl_ofi.h

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1013,8 +1013,11 @@ ompi_mtl_ofi_cancel(struct mca_mtl_base_module_t *mtl,
10131013
*/
10141014
while (!ofi_req->super.ompi_req->req_status._cancelled) {
10151015
opal_progress();
1016+
if (ofi_req->req_started)
1017+
goto ofi_cancel_not_possible;
10161018
}
10171019
} else {
1020+
ofi_cancel_not_possible:
10181021
/**
10191022
* Could not cancel the request.
10201023
*/

0 commit comments

Comments
 (0)