Skip to content

[BUG] TCP keep alive sockets not working #16447

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
1 task done
linguini1 opened this issue May 26, 2025 · 9 comments · Fixed by #16456
Closed
1 task done

[BUG] TCP keep alive sockets not working #16447

linguini1 opened this issue May 26, 2025 · 9 comments · Fixed by #16456
Labels
Arch: arm Issues related to ARM (32-bit) architecture Area: Networking Effects networking subsystem OS: Linux Issues related to Linux (building system, etc) Type: Bug Something isn't working

Comments

@linguini1
Copy link
Contributor

Description / Steps to reproduce the issue

I am having some issues with TCP keep-alive on the W5500-EVB pico. Namely, all the keep-alive socket options can be set nominally (no error from setsockopt), but the connection does not reset after the keep-alive time. I am testing this by setting up a connection successfully between two W5500-EVB Picos, and then turning off the client EVB Pico and waiting for the timeout to occur. The timeout never happens (blocking recv should return ECONNRESET as it does on Linux, or at least some error code).

For testing, I am using 2 probes with an interval of 5 seconds.

This is the code I am using to setup the TCP socket:

static int controller_init(controller_t *controller, uint16_t port) {
    int err;

    /* Initialize the socket connection. */
    controller->sock = socket(AF_INET, SOCK_STREAM, 0);
    if (controller->sock < 0) return errno;

    int opt = 1;
    err = setsockopt(controller->sock, SOL_SOCKET, SO_REUSEADDR, &opt, sizeof(opt));
    if (err < 0) {
        herr("Failed to set option SO_REUSEADDR: %d\n", errno);
        return errno;
    }

    /* Create address */
    controller->addr.sin_family = AF_INET;
    controller->addr.sin_addr.s_addr = INADDR_ANY;
    controller->addr.sin_port = htons(port);

    /* Make sure client socket fd is marked as invalid until it gets a connection */
    controller->client = -1;

    if (bind(controller->sock, (struct sockaddr *)&controller->addr, sizeof(controller->addr)) < 0) {
        herr("Failed to bind\n");
        return errno;
    }

    /* Set up keep-alive options */

    int keepalive = 1;
    err = setsockopt(controller->sock, SOL_SOCKET, SO_KEEPALIVE, (void *)&keepalive, sizeof(keepalive));
    if (err < 0) {
        herr("Failed to set socket as keep-alive: %d.\n", errno);
        return errno;
    }

    /* Each interval between ACK probes is `int_secs` long */

    int int_secs = KEEPALIVE_INTERVAL_SECS;
    err = setsockopt(controller->sock, IPPROTO_TCP, TCP_KEEPINTVL, &int_secs, sizeof(int));
    if (err < 0) {
        herr("Failed to set keep-alive interval to %d: %d.\n", int_secs, errno);
        return errno;
    }

    hinfo("Set keep-alive interval %d s\n", int_secs);

    int count = KEEPALIVE_N_PROBES; /* Gives 10 probes (10 * `int_secs` seconds) to regain connection */
    err = setsockopt(controller->sock, IPPROTO_TCP, TCP_KEEPCNT, &count, sizeof(int));
    if (err < 0) {
        herr("Failed to set keep-alive probe count to %d: %d.\n", count, errno);
        return errno;
    }

    hinfo("Set number of keep-alive probes: %d\n", count);

    return 0;
}

And here is the code where I expect to see the timeout:

// controller->client is the `accept`-ed client connection fd
bread = recv(controller->client, (char *)&hdr + total_read, sizeof(hdr) - total_read, 0);
if (bread == -1) {
    herr("Error reading message header: %s\n", strerror(errno));

    if (errno == ECONNRESET) {
        // TODO: this should trigger an abort because it happens when TCP keep-alive is done
        herr("Lost connection! ABORT!\n");
    }

    break;
} else if (bread == 0) {
    herr("Control box disconnected.\n");
    break;
}
total_read += bread;

I have also used Wireshark to listen to the network traffic being sent and I am unable to see any of the probes being sent at all.

On which OS does this issue occur?

[OS: Linux]

What is the version of your OS?

Linux 6.14.7-arch2-1 SMP PREEMPT_DYNAMIC Thu, 22 May 2025 05:37:49 +0000 x86_64 GNU/Linux

NuttX Version

master

Issue Architecture

[Arch: arm]

Issue Area

[Area: Networking]

Host information

No response

Verification

  • I have verified before submitting the report.
@linguini1 linguini1 added the Type: Bug Something isn't working label May 26, 2025
@github-actions github-actions bot added Arch: arm Issues related to ARM (32-bit) architecture Area: Networking Effects networking subsystem OS: Linux Issues related to Linux (building system, etc) labels May 26, 2025
@linguini1
Copy link
Contributor Author

linguini1 commented May 26, 2025

Noticed that after the probes receive no reply, the TCP stack calls the abort mechanism indefinitely:

[CPU0] [ 2] tcp_timer: aborted
[CPU0] [ 2] tcp_timer: aborting
[CPU0] [ 2] tcp_timer: aborted
[CPU0] [ 2] tcp_timer: aborting
[CPU0] [ 2] tcp_timer: aborted
[CPU0] [ 2] tcp_timer: aborting
[CPU0] [ 2] tcp_timer: aborted
[CPU0] [ 2] tcp_timer: aborting
[CPU0] [ 2] tcp_timer: aborted
[CPU0] [ 2] tcp_timer: aborting
[CPU0] [ 2] tcp_timer: aborted
[CPU0] [ 2] tcp_timer: aborting
[CPU0] [ 2] tcp_timer: aborted
[CPU0] [ 2] tcp_timer: aborting

This is net/tcp/tcp_timer.c:704

It appears the mechanism is working, but the networking logic does not remove the connection properly so tcp_timer keeps getting called, and the user thread blocking on recv is never notified that the connection is reset.

@linguini1
Copy link
Contributor Author

linguini1 commented May 26, 2025

@a-lunev Do you have any ideas what this issue might be? I saw you've made some changes a long time ago to the networking logic, and thought you might now better than me.

@gregory-nutt I see that most of the TCP implementation was written by you in 2017. Do you have any idea what may be wrong here, or could you advise me about how to use your netloop example application to help debug this further?

@linguini1
Copy link
Contributor Author

@wangyingdong I saw you wrote the zero-probe timer support and tcp_xmit_probe(). Do you have any advice which may help with this issue?

@bskdany
Copy link
Contributor

bskdany commented May 26, 2025

@zhhyu7 I saw you removed the tcp_callback when accepting conn timeout, do you happen to know if that change might have interfered with TCP Keepalive?

@EliasJRH
Copy link

EliasJRH commented May 27, 2025

(Sorry for excessive mentions) @wengzhe you were mentioned on a what seems to be a similar issue here #13493. Do you have any advice?

@zhhyu7
Copy link
Contributor

zhhyu7 commented May 27, 2025

@linguini1 could you test the following patch?

tcp_timer:
                  /* Yes.. Has the retry count expired? */

                  if (conn->keepretries >= conn->keepcnt)
                    {
                      /* Yes... stop the network monitor, closing the
                       * connection and all sockets associated with the
                       * connection.
                       */

+                     devif_conn_event(conn->dev, TCP_ABORT, conn->sconn.list);
                      tcp_stop_monitor(conn, TCP_ABORT);
                    }

@linguini1
Copy link
Contributor Author

@linguini1 could you test the following patch?

tcp_timer:
                  /* Yes.. Has the retry count expired? */

                  if (conn->keepretries >= conn->keepcnt)
                    {
                      /* Yes... stop the network monitor, closing the
                       * connection and all sockets associated with the
                       * connection.
                       */

+                     devif_conn_event(conn->dev, TCP_ABORT, conn->sconn.list);
                      tcp_stop_monitor(conn, TCP_ABORT);
                    }

Thank you @zhhyu7, this resolved the issue! I am very appreciative of your help. Would you be able to either submit the patch as a PR or would you like me to submit this patch?

@zhhyu7
Copy link
Contributor

zhhyu7 commented May 28, 2025

@linguini1 could you test the following patch?

tcp_timer:
                  /* Yes.. Has the retry count expired? */

                  if (conn->keepretries >= conn->keepcnt)
                    {
                      /* Yes... stop the network monitor, closing the
                       * connection and all sockets associated with the
                       * connection.
                       */

+                     devif_conn_event(conn->dev, TCP_ABORT, conn->sconn.list);
                      tcp_stop_monitor(conn, TCP_ABORT);
                    }

Thank you @zhhyu7, this resolved the issue! I am very appreciative of your help. Would you be able to either submit the patch as a PR or would you like me to submit this patch?

@linguini1 You can submit the PR directly. Thank you very much.

@linguini1
Copy link
Contributor Author

You can submit the PR directly. Thank you very much.

Thank you!

linguini1 added a commit to linguini1/nuttx that referenced this issue May 28, 2025
The TCP keep-alive implementation would send the probes and abort the
connection correctly, but without waking up the affected socket and
marking the connection as aborted. This patch from zhhyu7 in apache#16447
resolves the issue.

Signed-off-by: Matteo Golin <matteo.golin@gmail.com>
acassis pushed a commit that referenced this issue May 28, 2025
The TCP keep-alive implementation would send the probes and abort the
connection correctly, but without waking up the affected socket and
marking the connection as aborted. This patch from zhhyu7 in #16447
resolves the issue.

Signed-off-by: Matteo Golin <matteo.golin@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arch: arm Issues related to ARM (32-bit) architecture Area: Networking Effects networking subsystem OS: Linux Issues related to Linux (building system, etc) Type: Bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants