Skip to content

Issue with commands not timing out when the server is unresponsive and rate of outgoing traffic is high #1225

@rahulKrishnaM

Description

@rahulKrishnaM

Hi,

We are using hiredis-cluster library to interact with a redis cluster. Recently, we came across an issue where hiredis library was taking a very long time (more than 4mins) to detect that the redis server was down, when there was an abrupt disconnection of tcp connection (no FIN packet sent from server), and continued to send packets to the server IP that went down. Looking more into the hiredis code we were thinking if there could be an issue with the command timeout refresh logic.

  1. When the socket is writeable, the timeout gets refreshed here.

    hiredis/async.c

    Line 750 in 869f3d0

    _EL_ADD_WRITE(ac);

  2. When application tries to queue data to hiredis, the timeout gets refreshed here.

    hiredis/async.c

    Line 961 in 869f3d0

    _EL_ADD_WRITE(ac);

Let's say the server is nonresponsive. The socket might not become writeable any time soon, so code path 1 will never get hit. However, application continues to push data to hiredis out buffer at the same rate since it is not aware of the server state. As part of this code flow 2 gets invoked, and refresh timeout gets updated. But this continues to for a long time, and the timeout keeps getting extended for a long period of time. I feel this is causing the issue here. Also, just to highlight, if the application is not pumping data at a regular/high rate, we might see the timeout is happening as expected, as code path 2 won't get hit within the command_timeout period, and hiredis won't refresh the timeout.

The question I have is why should the timeout be refreshed as part of point 2 ? When a socket read/write event happens refreshing the timeout makes sense. However, when application is pushing commands to hiredis, isn't it enough to just add a write event without refreshing the timeout. This would inturn result in a non-responsive server to timeout early and give back a "timeout" error response, which would result in client detecting the faulty server early and sending disconnect to it.

Please share your thoughts.

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions