Issue with commands not timing out when the server is unresponsive and rate of outgoing traffic is high

Hi,

We are using hiredis-cluster library to interact with a redis cluster. Recently, we came across an issue where hiredis library was taking a very long time (more than 4mins) to detect that the redis server was down, when there was an abrupt disconnection of tcp connection (no FIN packet sent from server), and continued to send packets to the server IP that went down. Looking more into the hiredis code we were thinking if there could be an issue with the command timeout refresh logic.

1. When the socket is writeable, the timeout gets refreshed here.
https://github.com/redis/hiredis/blob/869f3d0ef1513dd0258ad7190c9914df16dcc4a4/async.c#L750

2. When application tries to queue data to hiredis, the timeout gets refreshed here.
https://github.com/redis/hiredis/blob/869f3d0ef1513dd0258ad7190c9914df16dcc4a4/async.c#L961

Let's say the server is nonresponsive. The socket might not become writeable any time soon, so code path 1 will never get hit. However, application continues to push data to hiredis out buffer at the same rate since it is not aware of the server state. As part of this code flow 2 gets invoked, and refresh timeout gets updated. But this continues to for a long time, and the timeout keeps getting extended for a long period of time. I feel this is causing the issue here. Also, just to highlight, if the application is not pumping data at a regular/high rate, we might see the timeout is happening as expected, as code path 2 won't get hit within the command_timeout period, and hiredis won't refresh the timeout.

The question I have is why should the timeout be refreshed as part of point 2 ? When a socket read/write event happens refreshing the timeout makes sense. However, when application is pushing commands to hiredis, isn't it enough to just add a write event without refreshing the timeout. This would inturn result in a non-responsive server to timeout early and give back a "timeout" error response, which would result in client detecting the faulty server early and sending disconnect to it.

Please share your thoughts. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Issue with commands not timing out when the server is unresponsive and rate of outgoing traffic is high #1225

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue with commands not timing out when the server is unresponsive and rate of outgoing traffic is high #1225

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions