Skip to content

Commit d536071

Browse files
committed
btl/tcp: Skip printing error message in racy cleanup path
Avoid printing an error message about ENOTCONN return codes from getpeername() when handling an incoming connection request. At this point in the receive state machine, the remote process has been verified to be a valid OMPI instance. In all-to-all startup at 4k rank scale, we're seeing this error message when the remote side drops the connection because it realizes it's the "loser" in the connection race. We were already doing all the right things, other than printing a scary error message. So skip the error message and call it good. Signed-off-by: Brian Barrett <bbarrett@amazon.com>
1 parent cf49957 commit d536071

File tree

1 file changed

+7
-5
lines changed

1 file changed

+7
-5
lines changed

opal/mca/btl/tcp/btl_tcp_component.c

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1515,11 +1515,13 @@ static void mca_btl_tcp_component_recv_handler(int sd, short flags, void* user)
15151515

15161516
/* lookup peer address */
15171517
if(getpeername(sd, (struct sockaddr*)&addr, &addr_len) != 0) {
1518-
opal_show_help("help-mpi-btl-tcp.txt",
1519-
"server getpeername failed",
1520-
true, opal_process_info.nodename,
1521-
getpid(),
1522-
strerror(opal_socket_errno), opal_socket_errno);
1518+
if (ENOTCONN != opal_socket_errno) {
1519+
opal_show_help("help-mpi-btl-tcp.txt",
1520+
"server getpeername failed",
1521+
true, opal_process_info.nodename,
1522+
getpid(),
1523+
strerror(opal_socket_errno), opal_socket_errno);
1524+
}
15231525
CLOSE_THE_SOCKET(sd);
15241526
return;
15251527
}

0 commit comments

Comments
 (0)