Skip to content

ApiListener fails to reconnect on heavily loaded system #10376

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
oldelvet opened this issue Mar 13, 2025 · 6 comments
Open

ApiListener fails to reconnect on heavily loaded system #10376

oldelvet opened this issue Mar 13, 2025 · 6 comments

Comments

@oldelvet
Copy link
Contributor

Describe the bug

On a very heavily loaded system the ApiListener may disconnect presumably due to timeout. When this happens a reconnect is started but times out before the connection establishes. After this point no more attempts to reconnect are made until the icinga2 daemon is restarted.

[2025-03-13 00:26:31 +0000] information/Application: We jumped forward in time: 74.132 seconds
[2025-03-13 00:38:07 +0000] information/Application: We jumped forward in time: 713.587 seconds
[2025-03-13 00:39:50 +0000] warning/ApiListener: Removing API client for endpoint 'shirehall'. 0 API clients left.
[2025-03-13 00:39:51 +0000] warning/JsonRpcConnection: API client disconnected for identity 'shirehall'
[2025-03-13 00:40:02 +0000] information/WorkQueue: #6 (ApiListener, RelayQueue) items: 0, rate: 0.6/s (36/min 36/5min 36/15min);
[2025-03-13 00:40:02 +0000] information/WorkQueue: #7 (ApiListener, SyncQueue) items: 0, rate: 0/s (0/min 0/5min 0/15min);
[2025-03-13 00:40:03 +0000] information/ApiListener: Reconnecting to endpoint 'shirehall' via host 'shirehall' and port '5665'
[2025-03-13 00:40:44 +0000] critical/ApiListener: Timeout while reconnecting to endpoint 'shirehall' via host 'shirehall' and port '5665', cancelling attempt
[2025-03-13 00:41:04 +0000] information/ApiListener: New client connection for identity 'shirehall' to [xxxx:xxxx:xxxx:xxxx::xx]:5665
[2025-03-13 00:43:05 +0000] information/ConfigObject: Dumping program state to file '/var/lib/icinga2/icinga2.state'
[2025-03-13 00:45:02 +0000] information/WorkQueue: #6 (ApiListener, RelayQueue) items: 0, rate: 0.466667/s (28/min 65/5min 101/15min);
[2025-03-13 00:45:02 +0000] information/WorkQueue: #7 (ApiListener, SyncQueue) items: 0, rate: 0/s (0/min 0/5min 0/15min);

To Reproduce

I cannot give a step by step guide to reproducing other than having a system that occasionally gets itself into a extremely high load situation.

Note that the remote endpoint is in a container on the same heavily loaded computer so that will explain the slowness in responding to the connection request. Many other connections to that same remote endpoint continue so it does suggest to me that the problem lies in the local endpoint and not remote.

Expected behavior

The endpoints should reconnect after the disruption. The following log is from another that did reconnect successfully after the same high load situation.

[2025-03-13 00:23:36 +0000] information/WorkQueue: #7 (ApiListener, SyncQueue) items: 0, rate: 0/s (0/min 0/5min 0/15min);
[2025-03-13 00:24:52 +0000] information/Application: We jumped forward in time: 20.5464 seconds
[2025-03-13 00:39:55 +0000] information/WorkQueue: #6 (ApiListener, RelayQueue) items: 0, rate: 0.966667/s (58/min 126/5min 290/15min);
[2025-03-13 00:39:55 +0000] information/ConfigObject: Dumping program state to file '/var/lib/icinga2/icinga2.state'
[2025-03-13 00:39:57 +0000] information/Application: We jumped forward in time: 904.955 seconds
[2025-03-13 00:40:00 +0000] information/JsonRpcConnection: No messages for identity 'shirehall' have been received in the last 60 seconds.
[2025-03-13 00:40:05 +0000] information/WorkQueue: #7 (ApiListener, SyncQueue) items: 0, rate: 0/s (0/min 0/5min 0/15min);
[2025-03-13 00:40:06 +0000] warning/ApiListener: Removing API client for endpoint 'shirehall'. 0 API clients left.
[2025-03-13 00:40:06 +0000] warning/JsonRpcConnection: API client disconnected for identity 'shirehall'
[2025-03-13 00:40:15 +0000] information/ApiListener: Reconnecting to endpoint 'shirehall' via host 'shirehall' and port '5665'
[2025-03-13 00:40:44 +0000] critical/ApiListener: Timeout while reconnecting to endpoint 'shirehall' via host 'shirehall' and port '5665', cancelling attempt
[2025-03-13 00:41:06 +0000] information/ApiListener: New client connection for identity 'shirehall' to [xxxx:xxxx:xxxx:xxxx::xx]:5665
[2025-03-13 00:41:06 +0000] information/JsonRpcConnection: Requesting new certificate for this Icinga instance from endpoint 'shirehall'.
[2025-03-13 00:41:06 +0000] information/ApiListener: Sending config updates for endpoint 'shirehall' in zone 'shirehall'.
[2025-03-13 00:41:06 +0000] information/ApiListener: Finished sending config file updates for endpoint 'shirehall' in zone 'shirehall'.
[2025-03-13 00:41:06 +0000] information/ApiListener: Syncing runtime objects to endpoint 'shirehall'.
[2025-03-13 00:41:06 +0000] information/ApiListener: Finished reconnecting to endpoint 'shirehall' via host 'shirehall' and port '5665'
[2025-03-13 00:41:25 +0000] information/ApiListener: Finished syncing runtime objects to endpoint 'shirehall'.
[2025-03-13 00:41:25 +0000] information/ApiListener: Finished sending runtime config updates for endpoint 'shirehall' in zone 'shirehall'.
[2025-03-13 00:41:25 +0000] information/ApiListener: Sending replay log for endpoint 'shirehall' in zone 'shirehall'.
[2025-03-13 00:41:26 +0000] information/ApiListener: Replayed 14 messages.
[2025-03-13 00:41:26 +0000] information/ApiListener: Finished sending replay log for endpoint 'shirehall' in zone 'shirehall'.
[2025-03-13 00:41:26 +0000] information/ApiListener: Finished syncing endpoint 'shirehall' in zone 'shirehall'.

Your Environment

Include as many relevant details about the environment you experienced the problem in

  • Version used (icinga2 --version):

Observed on multiple versions of icinga2

version: r2.14.5-1
version: r2.13.6-1
version: r2.12.3-1

  • Operating System and version:

linux amd64. Ubuntu 24.04 kernel with various Debian/Ubuntu versions running in LXC containers.

  • Enabled features (icinga2 feature list):

Enabled features: api checker mainlog

  • Config validation (icinga2 daemon -C):

  • If you run multiple Icinga 2 instances, the zones.conf file (or icinga2 object list --type Endpoint and icinga2 object list --type Zone) from all affected nodes.

root@alpha:~# cat /etc/icinga2/zones.conf
/*

  • Generated by Icinga 2 node setup commands
  • on 2020-10-20 20:25:28 +0000
    */

object Endpoint "shirehall" {
host = "shirehall"
port = "5665"
}

object Zone "shirehall" {
endpoints = [ "shirehall" ]
}

object Endpoint "alpha" {
}

object Zone "alpha" {
endpoints = [ "alpha" ]
parent = "shirehall"
}

object Zone "global-templates" {
global = true
}

object Zone "director-global" {
global = true
}

Additional context

This failure mode is observed on the same systems where I have observed #10355 . I think that they are two separate issues and the ApiListener problem occurs both with and without the fix for #10355 applied.

I am currently running a custom build of 2.14.5-1 and have added additional logging into 'lib/remote/apilistener.cpp' to trace the path that the failing instances take through 'ApiListener::NewClientHandlerInternal'. The execution continues as far as either 'SendMessage' or 'async_flush' for 'RoleClient'

if (role == RoleClient) {
JsonRpc::SendMessage(client, new Dictionary({
{ "jsonrpc", "2.0" },
{ "method", "icinga::Hello" },
{ "params", new Dictionary({
{ "version", (double)l_AppVersionInt },
{ "capabilities", (double)l_MyCapabilities }
}) }
}), yc);
client->async_flush(yc);
ctype = ClientJsonRpc;

That causes a 'system_error' to be thrown which gets caught at

} catch (const boost::system::system_error& systemError) {
if (systemError.code() == boost::asio::error::operation_aborted) {
shutDownIfNeeded.Cancel();
}
throw;

I did not log the 'systemError.code()' value.

I am happy to add more debug and/or test possible fixes.

@oldelvet
Copy link
Contributor Author

I experimented with sending a 'HUP' signal to perform a daemon configuration reload on one of the instances that was sat without a connection to the parent endpoint. This caused the daemon to shutdown the stalled 'ApiListener' and start a new listener that then reconnected successfully to the parent.

root@workshop:# kill -HUP 727
root@workshop:
# [2025-03-13 12:22:37 +0000] information/Application: Received request to shut down.
[2025-03-13 12:22:38 +0000] information/Application: Shutting down...
[2025-03-13 12:22:38 +0000] information/CheckerComponent: 'checker' stopped.
[2025-03-13 12:22:38 +0000] information/ApiListener: 'api' stopped.
[2025-03-13 12:22:38 +0000] information/FileLogger: 'main-log' started.
[2025-03-13 12:22:38 +0000] information/ApiListener: 'api' started.
[2025-03-13 12:22:38 +0000] information/ApiListener: Started new listener on '[::]:5665'
[2025-03-13 12:22:38 +0000] information/CheckerComponent: 'checker' started.
[2025-03-13 12:22:38 +0000] information/ConfigItem: Activated all objects.
[2025-03-13 12:22:39 +0000] information/ApiListener: Reconnecting to endpoint 'shirehall' via host 'shirehall' and port '5665'
[2025-03-13 12:22:39 +0000] information/ApiListener: New client connection for identity 'shirehall' to [xxxx:xxxx:xxxx:xxxx::xx]:5665
[2025-03-13 12:22:39 +0000] information/JsonRpcConnection: Requesting new certificate for this Icinga instance from endpoint 'shirehall'.
[2025-03-13 12:22:39 +0000] information/ApiListener: Sending config updates for endpoint 'shirehall' in zone 'shirehall'.
[2025-03-13 12:22:39 +0000] information/ApiListener: Finished sending config file updates for endpoint 'shirehall' in zone 'shirehall'.
[2025-03-13 12:22:39 +0000] information/ApiListener: Syncing runtime objects to endpoint 'shirehall'.
[2025-03-13 12:22:39 +0000] information/ApiListener: Finished reconnecting to endpoint 'shirehall' via host 'shirehall' and port '5665'
[2025-03-13 12:22:39 +0000] information/ApiListener: Finished syncing runtime objects to endpoint 'shirehall'.
[2025-03-13 12:22:39 +0000] information/ApiListener: Finished sending runtime config updates for endpoint 'shirehall' in zone 'shirehall'.
[2025-03-13 12:22:39 +0000] information/ApiListener: Sending replay log for endpoint 'shirehall' in zone 'shirehall'.
[2025-03-13 12:22:39 +0000] information/ApiListener: Replayed 8084 messages.
[2025-03-13 12:22:39 +0000] information/ApiListener: Finished sending replay log for endpoint 'shirehall' in zone 'shirehall'.
[2025-03-13 12:22:39 +0000] information/ApiListener: Finished syncing endpoint 'shirehall' in zone 'shirehall'.
[2025-03-13 12:22:41 +0000] information/ApiListener: Applying config update from endpoint 'shirehall' of zone 'shirehall'.
[2025-03-13 12:22:41 +0000] information/ApiListener: Our production configuration is more recent than the received configuration update. Ignoring configuration file update for path '/var/lib/icinga2/api/zones-stage/global-templates'. Current timestamp '2025-02-17 14:03:51 +0000' (1739801031.214035) >= received timestamp '2025-02-17 14:03:51 +0000' (1739801031.214035).

I did not try '/usr/lib/icinga2/safe-reload' or 'systemctl reload icinga2.service' but they both ultimately cause the 'SIGHUP' to be delivered so will likely work too.

@oldelvet
Copy link
Contributor Author

I captured another occurrence overnight with 'debuglog' enabled.

There are signs of heavy load and things being delayed at '05:27:30' with checks timing out due to the load.

[2025-03-14 05:27:30 +0000] warning/Process: Terminating process 167042 ('/usr/lib/nagios/plugins/check_users' '-c' '10' '-w' '5') after timeout of 60 seconds

At '05:27:32' 'JsonRpcConnection' reports that no messages have been received and that it is disconnecting the API client.

[2025-03-14 05:27:32 +0000] information/JsonRpcConnection: No messages for identity 'shirehall' have been received in the last 60 seconds.
[2025-03-14 05:27:32 +0000] notice/JsonRpcConnection: Disconnecting API client for identity 'shirehall'

At '05:27:33' the 'ApiListener' reports that it thinks the connection is still present.

[2025-03-14 05:27:33 +0000] notice/ApiListener: Updating object authority for objects at endpoint 'gala'.
[2025-03-14 05:27:33 +0000] debug/ApiListener: Not connecting to Endpoint 'gala' because that's us.
[2025-03-14 05:27:33 +0000] debug/ApiListener: Not connecting to Endpoint 'shirehall' because we're already connected to it.
[2025-03-14 05:27:33 +0000] notice/ApiListener: Current zone master: gala
[2025-03-14 05:27:33 +0000] notice/ApiListener: Connected endpoints: shirehall (1)

Over the next few seconds there are signs that the connection is being shutdown. Interleaved in that are some information messages that time has jumped forward (presumably due to the load).

[2025-03-14 05:27:33 +0000] notice/ApiListener: Setting log position for identity 'shirehall': 2025/03/14 01:26:29
[2025-03-14 05:27:34 +0000] notice/ApiListener: Error while setting log position for identity 'shirehall': Error: Cannot send message to already disconnected API client 'shirehall'!
[2025-03-14 05:27:36 +0000] notice/ApiListener: Error while sending message to endpoint 'shirehall': Error: Cannot send message to already disconnected API client 'shirehall'!
[2025-03-14 05:27:36 +0000] notice/ApiListener: Error while sending message to endpoint 'shirehall': Error: Cannot send message to already disconnected API client 'shirehall'!
[2025-03-14 05:27:36 +0000] notice/ApiListener: Error while sending message to endpoint 'shirehall': Error: Cannot send message to already disconnected API client 'shirehall'!
[2025-03-14 05:27:36 +0000] notice/ApiListener: Error while sending message to endpoint 'shirehall': Error: Cannot send message to already disconnected API client 'shirehall'!
[2025-03-14 05:27:37 +0000] information/Application: We jumped forward in time: 28.4746 seconds
[2025-03-14 05:27:39 +0000] notice/ApiListener: Setting log position for identity 'shirehall': 2025/03/14 01:26:29
[2025-03-14 05:27:39 +0000] notice/ApiListener: Error while setting log position for identity 'shirehall': Error: Cannot send message to already disconnected API client 'shirehall'!
[2025-03-14 05:27:39 +0000] information/Application: We jumped forward in time: 173.286 seconds
[2025-03-14 05:27:40 +0000] notice/ApiListener: Error while sending message to endpoint 'shirehall': Error: Cannot send message to already disconnected API client 'shirehall'!
[2025-03-14 05:27:40 +0000] notice/ApiListener: Error while sending message to endpoint 'shirehall': Error: Cannot send message to already disconnected API client 'shirehall'!
[2025-03-14 05:27:40 +0000] notice/ApiListener: Error while sending message to endpoint 'shirehall': Error: Cannot send message to already disconnected API client 'shirehall'!
[2025-03-14 05:27:40 +0000] notice/ApiListener: Error while sending message to endpoint 'shirehall': Error: Cannot send message to already disconnected API client 'shirehall'!
[2025-03-14 05:27:41 +0000] notice/ApiListener: Error while sending message to endpoint 'shirehall': Error: Cannot send message to already disconnected API client 'shirehall'!
[2025-03-14 05:27:41 +0000] notice/ApiListener: Error while sending message to endpoint 'shirehall': Error: Cannot send message to already disconnected API client 'shirehall'!
[2025-03-14 05:27:41 +0000] notice/ApiListener: Error while sending message to endpoint 'shirehall': Error: Cannot send message to already disconnected API client 'shirehall'!
[2025-03-14 05:27:41 +0000] notice/ApiListener: Error while sending message to endpoint 'shirehall': Error: Cannot send message to already disconnected API client 'shirehall'!
[2025-03-14 05:27:42 +0000] debug/JsonRpcConnection: Error while reading JSON-RPC message for identity 'shirehall': Error: Operation canceled
Stacktrace:
0# __cxa_throw in /usr/lib/x86_64-linux-gnu/icinga2/sbin/icinga2
1# 0x00006285091D2233 in /usr/lib/x86_64-linux-gnu/icinga2/sbin/icinga2
2# icinga::NetString::ReadStringFromStream(boost::intrusive_ptr<icinga::Sharedicinga::AsioTlsStream > const&, boost::asio::basic_yield_context<boost::asio::executor_binder<void ()(), boost::asio::executor> >, long) in /usr/lib/x86_64-linux-gnu/icinga2/sbin/icinga2
3# icinga::JsonRpcConnection::HandleIncomingMessages(boost::asio::basic_yield_context<boost::asio::executor_binder<void (
)(), boost::asio::executor> >) in /usr/lib/x86_64-linux-gnu/icinga2/sbin/icinga2
4# 0x0000628509470AFF in /usr/lib/x86_64-linux-gnu/icinga2/sbin/icinga2
5# 0x00006285094ADAD5 in /usr/lib/x86_64-linux-gnu/icinga2/sbin/icinga2
6# make_fcontext in /lib/x86_64-linux-gnu/libboost_context.so.1.74.0
[2025-03-14 05:27:42 +0000] warning/ApiListener: Removing API client for endpoint 'shirehall'. 0 API clients left.
[2025-03-14 05:27:42 +0000] debug/EndpointDbObject: update is_connected=0 for endpoint 'shirehall'
[2025-03-14 05:27:42 +0000] warning/JsonRpcConnection: API client disconnected for identity 'shirehall'

At '05:27:43' a reconnection attempt is logged

[2025-03-14 05:27:43 +0000] debug/ApiListener: Not connecting to Endpoint 'gala' because that's us.
[2025-03-14 05:27:43 +0000] notice/ApiListener: Current zone master: gala
[2025-03-14 05:27:43 +0000] notice/ApiListener: Connected endpoints:
[2025-03-14 05:27:43 +0000] information/ApiListener: Reconnecting to endpoint 'shirehall' via host 'shirehall' and port '5665'

At '05:27:53' the reconnection timer reports that a connection attempt is in progress.

[2025-03-14 05:27:53 +0000] notice/ApiListener: Updating object authority for objects at endpoint 'gala'.
[2025-03-14 05:27:53 +0000] debug/ApiListener: Not connecting to Endpoint 'gala' because that's us.
[2025-03-14 05:27:53 +0000] debug/ApiListener: Not connecting to Endpoint 'shirehall' because we're already trying to connect to it.
[2025-03-14 05:27:53 +0000] notice/ApiListener: Current zone master: gala
[2025-03-14 05:27:53 +0000] notice/ApiListener: Connected endpoints:

The connection attempt times out at '05:27:59'

[2025-03-14 05:27:59 +0000] critical/ApiListener: Timeout while reconnecting to endpoint 'shirehall' via host 'shirehall' and port '5665', cancelling attempt

At '05:28:03' and every 10 seconds afterwards the reconnection timer reports that a connection attempt is still in progress and declines to start a new attempt.

[2025-03-14 05:28:03 +0000] notice/ApiListener: Updating object authority for objects at endpoint 'gala'.
[2025-03-14 05:28:03 +0000] debug/ApiListener: Not connecting to Endpoint 'gala' because that's us.
[2025-03-14 05:28:03 +0000] debug/ApiListener: Not connecting to Endpoint 'shirehall' because we're already trying to connect to it.
[2025-03-14 05:28:03 +0000] notice/ApiListener: Current zone master: gala
[2025-03-14 05:28:03 +0000] notice/ApiListener: Connected endpoints:

The (TCP?) connection actually connects at '05:25:14' but fails with a 'system_error' as previously described.

[2025-03-14 05:28:14 +0000] information/ApiListener: New client connection for identity 'shirehall' to [xxxx:xxxx:xxxx:xxxx::xx]:5665
[2025-03-14 05:28:14 +0000] information/ApiListener: Debug client connection to [xxxx:xxxx:xxxx:xxxx::xx]:5665 RoleClient
[2025-03-14 05:28:14 +0000] information/ApiListener: Debug client connection to [xxxx:xxxx:xxxx:xxxx::xx]:5665 system error

Based on this it seems to me that the system error handling code is not sufficiently cleaning up the active connection attempt to allow the reconnection timer code to realise that a new connection attempt is required.

The debug log below was filtered to remove extraneous data using the following command.

grep -v notice/Process debug.log|grep -v /Checkable |grep -v /CheckerCompo|grep -v notice/Dependency|grep -v "/ApiListener: Relaying"|grep -v "/ApiListener: Sending" > /tmp/250315_noreconnect_debug.log

250315_noreconnect_debug.log

@yhabteab
Copy link
Member

[2025-03-13 00:26:31 +0000] information/Application: We jumped forward in time: 74.132 seconds
[2025-03-13 00:39:57 +0000] information/Application: We jumped forward in time: 904.955 seconds
Interleaved in that are some information messages that time has jumped forward (presumably due to the load).

LOL! Wouldn't it make sense to find out why your system is so overloaded? If every system call takes literally minutes to process, don't expect Icinga 2 to behave normally.

[2025-03-14 05:27:59 +0000] critical/ApiListener: Timeout while reconnecting to endpoint 'shirehall' via host 'shirehall' and port '5665', cancelling attempt

At '05:28:03' and every 10 seconds afterwards the reconnection timer reports that a connection attempt is still in progress and declines to start a new attempt.

That's because the previously opened socket is not yet fully closed and may still be waiting for your overloaded system to clean it up properly. Otherwise, if you receive a ... Timeout while reconnecting to ..., cancelling attempt, you should directly get a Connot connect to host ... critical log message, but I guess it is hanging somewhere in boost.Asio lowest_layer.cancel(). I do not think there's anything we can do here!

@oldelvet
Copy link
Contributor Author

Oh yes. Finding the cause of the load is definitely something I am doing. But I am also doing all I can to work out what is going wrong with icinga2.

@oldelvet
Copy link
Contributor Author

I do now have a theory about why reconnection attempts do not restart. 'ApiListener::AddConnection' has a try catch block for 'std::exception' but neither the 'Cannot connect to host' log message or the normal 'Finished reconnecting to endpoint' messages are logged. Both of those code paths do 'endpoint->SetConnecting(false)'. I suspect that a catch of 'boost::system::system_error' as thrown by 'NewClientHandlerInternal' would allow a proper cleanup there. I am testing the following patch to see if this theory is correct.

connectexceptions.patch

@yhabteab
Copy link
Member

But I am also doing all I can to work out what is going wrong with icinga2.

It's stale and, as it seems, just like everything else on your system. If it jumps in time over 15 minutes and not just once, it makes no sense for me to hunt for bugs elsewhere.

I suspect that a catch of 'boost::system::system_error'

boost::system::system_error is an std::exception so that shouldn't make any difference. Like I said above, if you don't get the aforementioned critical log messages, then it's hanging somewhere in boost asio while it's closing the socket. If your system is able to run some manual commands during that time, you can simply run e.g. netstat -ant | grep 5665 and you will probably see that the socket is still open.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants