Skip to content

Commit ba3ab3c

Browse files
Alexander Aringteigland
authored andcommitted
fs: dlm: change handling of reconnects
This patch changes the handling of reconnects. At first we only close the connection related to the communication failure. If we get a new connection for an already existing connection we close the existing connection and take the new one. This patch improves significantly the stability of tcp connections while running "tcpkill -9 -i $IFACE port 21064" while generating a lot of dlm messages e.g. on a gfs2 mount with many files. My test setup shows that a deadlock is "more" unlikely. Before this patch I wasn't able to get not a deadlock after 5 seconds. After this patch my observation is that it's more likely to survive after 5 seconds and more, but still a deadlock occurs after certain time. My guess is that there are still "segments" inside the tcp writequeue or retransmit queue which get dropped when receiving a tcp reset [1]. Hard to reproduce because the right message need to be inside these queues, which might even be in the 5 first seconds with this patch. [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/net/ipv4/tcp_input.c?h=v5.8-rc6#n4122 Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
1 parent 0ea47e4 commit ba3ab3c

File tree

1 file changed

+10
-15
lines changed

1 file changed

+10
-15
lines changed

fs/dlm/lowcomms.c

Lines changed: 10 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -713,7 +713,7 @@ static int receive_from_sock(struct connection *con)
713713
out_close:
714714
mutex_unlock(&con->sock_mutex);
715715
if (ret != -EAGAIN) {
716-
close_connection(con, true, true, false);
716+
close_connection(con, false, true, false);
717717
/* Reconnect when there is something to send */
718718
}
719719
/* Don't return success if we really got EOF */
@@ -804,21 +804,16 @@ static int accept_from_sock(struct connection *con)
804804
INIT_WORK(&othercon->swork, process_send_sockets);
805805
INIT_WORK(&othercon->rwork, process_recv_sockets);
806806
set_bit(CF_IS_OTHERCON, &othercon->flags);
807+
} else {
808+
/* close other sock con if we have something new */
809+
close_connection(othercon, false, true, false);
807810
}
811+
808812
mutex_lock_nested(&othercon->sock_mutex, 2);
809-
if (!othercon->sock) {
810-
newcon->othercon = othercon;
811-
add_sock(newsock, othercon);
812-
addcon = othercon;
813-
mutex_unlock(&othercon->sock_mutex);
814-
}
815-
else {
816-
printk("Extra connection from node %d attempted\n", nodeid);
817-
result = -EAGAIN;
818-
mutex_unlock(&othercon->sock_mutex);
819-
mutex_unlock(&newcon->sock_mutex);
820-
goto accept_err;
821-
}
813+
newcon->othercon = othercon;
814+
add_sock(newsock, othercon);
815+
addcon = othercon;
816+
mutex_unlock(&othercon->sock_mutex);
822817
}
823818
else {
824819
newcon->rx_action = receive_from_sock;
@@ -1415,7 +1410,7 @@ static void send_to_sock(struct connection *con)
14151410

14161411
send_error:
14171412
mutex_unlock(&con->sock_mutex);
1418-
close_connection(con, true, false, true);
1413+
close_connection(con, false, false, true);
14191414
/* Requeue the send work. When the work daemon runs again, it will try
14201415
a new connection, then call this function again. */
14211416
queue_work(send_workqueue, &con->swork);

0 commit comments

Comments
 (0)