-
Notifications
You must be signed in to change notification settings - Fork 333
Description
Hi,
I tried to upgrade our distributed production environment (5 monitoring hosts monitoring 1500 instances and 50.000+ services) from 2.0.3 to 2.2 and I ran into serious trouble. We ran our daemons with loglevel INFO and got into issues (I'll get to that below). As a work around we tried setting the loglevel to WARNING, but after stopping and starting the services they kept logging at INFO level. So something is not working properly there it seems.
The bigger issue is the following. As stated we have a pretty large config. The arbiter outputs a line for each host and service at INFO level and if there's something like a service has no contact (also?) a WARNING line. So for our config this generates between 75.000 and 100.000 lines of output. Which is in itself not so bad, but for some reason these lines are being sent to the broker. The broker looks at these lines, concludes that it's not a valid line for storing in Mongo and then drops it. But it takes forever to do all this and while it's doing this it's not responding to requests on its livestatus module, so Thruk is unhappy. So as a work around we tried setting the loglevel to WARNING as stated earlier, but that didn't work.
Example of entries I got in brokerd.log:
[1424865789] INFO: [broker-ops-shinken01.example.com] [LogStoreMongoDB] This line is invalid: [1424865069] INFO: [Shinken] Processing object config file '/etc/shinken/monitoring_config/manual/hosts/gsp-lt-ndb-app01.example.com/gsp-lt-ndb-app01.example.com-FILESYSTEM__.cfg'
So why is the output from the config test of the arbiter sent to the broker in the first place?
I also got exciting stuff like this, but I suspect that's because the broker is busy plowing through those lines:
[1424865789] ERROR: [broker-ops-shinken01.example.com] LiveStatusClientError: Could not send response: [Errno 32] Broken pipe
I haven't had time to look through the code to find where this is done, but you can maybe pinpoint this much quicker than I can. If you need any info from me to help debug this, please let me know!
Thanks!
Guus