-
Notifications
You must be signed in to change notification settings - Fork 585
ApiListener fails to reconnect on heavily loaded system #10376
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I experimented with sending a 'HUP' signal to perform a daemon configuration reload on one of the instances that was sat without a connection to the parent endpoint. This caused the daemon to shutdown the stalled 'ApiListener' and start a new listener that then reconnected successfully to the parent.
I did not try '/usr/lib/icinga2/safe-reload' or 'systemctl reload icinga2.service' but they both ultimately cause the 'SIGHUP' to be delivered so will likely work too. |
I captured another occurrence overnight with 'debuglog' enabled. There are signs of heavy load and things being delayed at '05:27:30' with checks timing out due to the load.
At '05:27:32' 'JsonRpcConnection' reports that no messages have been received and that it is disconnecting the API client.
At '05:27:33' the 'ApiListener' reports that it thinks the connection is still present.
Over the next few seconds there are signs that the connection is being shutdown. Interleaved in that are some information messages that time has jumped forward (presumably due to the load).
At '05:27:43' a reconnection attempt is logged
At '05:27:53' the reconnection timer reports that a connection attempt is in progress.
The connection attempt times out at '05:27:59'
At '05:28:03' and every 10 seconds afterwards the reconnection timer reports that a connection attempt is still in progress and declines to start a new attempt.
The (TCP?) connection actually connects at '05:25:14' but fails with a 'system_error' as previously described.
Based on this it seems to me that the system error handling code is not sufficiently cleaning up the active connection attempt to allow the reconnection timer code to realise that a new connection attempt is required. The debug log below was filtered to remove extraneous data using the following command.
|
LOL! Wouldn't it make sense to find out why your system is so overloaded? If every system call takes literally minutes to process, don't expect Icinga 2 to behave normally.
That's because the previously opened socket is not yet fully closed and may still be waiting for your overloaded system to clean it up properly. Otherwise, if you receive a |
Oh yes. Finding the cause of the load is definitely something I am doing. But I am also doing all I can to work out what is going wrong with icinga2. |
I do now have a theory about why reconnection attempts do not restart. 'ApiListener::AddConnection' has a try catch block for 'std::exception' but neither the 'Cannot connect to host' log message or the normal 'Finished reconnecting to endpoint' messages are logged. Both of those code paths do 'endpoint->SetConnecting(false)'. I suspect that a catch of 'boost::system::system_error' as thrown by 'NewClientHandlerInternal' would allow a proper cleanup there. I am testing the following patch to see if this theory is correct. |
It's stale and, as it seems, just like everything else on your system. If it jumps in time over 15 minutes and not just once, it makes no sense for me to hunt for bugs elsewhere.
|
Describe the bug
On a very heavily loaded system the ApiListener may disconnect presumably due to timeout. When this happens a reconnect is started but times out before the connection establishes. After this point no more attempts to reconnect are made until the icinga2 daemon is restarted.
To Reproduce
I cannot give a step by step guide to reproducing other than having a system that occasionally gets itself into a extremely high load situation.
Note that the remote endpoint is in a container on the same heavily loaded computer so that will explain the slowness in responding to the connection request. Many other connections to that same remote endpoint continue so it does suggest to me that the problem lies in the local endpoint and not remote.
Expected behavior
The endpoints should reconnect after the disruption. The following log is from another that did reconnect successfully after the same high load situation.
Your Environment
Include as many relevant details about the environment you experienced the problem in
icinga2 --version
):Observed on multiple versions of icinga2
version: r2.14.5-1
version: r2.13.6-1
version: r2.12.3-1
linux amd64. Ubuntu 24.04 kernel with various Debian/Ubuntu versions running in LXC containers.
icinga2 feature list
):Enabled features: api checker mainlog
Config validation (
icinga2 daemon -C
):If you run multiple Icinga 2 instances, the
zones.conf
file (oricinga2 object list --type Endpoint
andicinga2 object list --type Zone
) from all affected nodes.Additional context
This failure mode is observed on the same systems where I have observed #10355 . I think that they are two separate issues and the ApiListener problem occurs both with and without the fix for #10355 applied.
I am currently running a custom build of 2.14.5-1 and have added additional logging into 'lib/remote/apilistener.cpp' to trace the path that the failing instances take through 'ApiListener::NewClientHandlerInternal'. The execution continues as far as either 'SendMessage' or 'async_flush' for 'RoleClient'
That causes a 'system_error' to be thrown which gets caught at
I did not log the 'systemError.code()' value.
I am happy to add more debug and/or test possible fixes.
The text was updated successfully, but these errors were encountered: