Skip to content

Commit 415dddb

Browse files
committed
mtl/ofi: Do not fail if error CQ is empty
In multi-threaded scenarios, any thread that attempts to read a CQ when there's a pending error CQ entry gets an -FI_EAVAIL. Without any serialization here (which is okay, since libfabric will protect access to critical CQ objects), all threads proceed to read from the error CQ, but only one thread fetches the entry while others get -FI_EAGAIN indicating an empty queue, which is not erroneous. Signed-off-by: Raghu Raja <craghun@amazon.com>
1 parent 992e8f9 commit 415dddb

File tree

1 file changed

+11
-0
lines changed

1 file changed

+11
-0
lines changed

ompi/mca/mtl/ofi/mtl_ofi.h

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -137,6 +137,17 @@ ompi_mtl_ofi_context_progress(int ctxt_id)
137137
&error,
138138
0);
139139
if (0 > ret) {
140+
/*
141+
* In multi-threaded scenarios, any thread that attempts to read
142+
* a CQ when there's a pending error CQ entry gets an
143+
* -FI_EAVAIL. Without any serialization here (which is okay,
144+
* since libfabric will protect access to critical CQ objects),
145+
* all threads proceed to read from the error CQ, but only one
146+
* thread fetches the entry while others get -FI_EAGAIN
147+
* indicating an empty queue, which is not erroneous.
148+
*/
149+
if (ret == -FI_EAGAIN)
150+
return count;
140151
opal_output(0, "%s:%d: Error returned from fi_cq_readerr: %s(%zd).\n"
141152
"*** The Open MPI OFI MTL is aborting the MPI job (via exit(3)).\n",
142153
__FILE__, __LINE__, fi_strerror(-ret), ret);

0 commit comments

Comments
 (0)