Skip to content

Adds check to automatically restart rkubelog every 24 hours #38

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

cfroystad
Copy link

Since using rkubelog, we've experienced quite often that all or even worse, just some log sources are no longer sent to papertrail.

The only way we've found to work around the problem is to restart the rkubelog pod. This is also the advice given in the readme.

To avoid having to restart the logging solution all the time and to give peace of mind that we're actually receiving all our logs - not just some of them, we've implemented an automatic restart of rkubelog every 24 hours.

Here's our code if it could be useful for the project until the underlying issue is fixed. (FYI: I closed the prior PR due to a mistake in the code and not wanting it to be merged by accident)

@jtomaszewski
Copy link

Asking out of curiosity, is there a downtime during that restart? Will some logs be dropped while rkubelog is restarted? Do you know maybe?

@cfroystad
Copy link
Author

Sorry, I haven't checked that. When I scale up and down manually, it's a matter of (milli-?)seconds. I've not checked if rkubelog does some kind of tracking to know which logs it has passed along or not.

However, in our case, it was a matter of potentially losing logs for hours if no one was actively monitoring the log ingestion at every time of the day/night. Thus, this was an acceptable solution while considering alternative approaches if Solarwinds/Papertrail can not provide stable log ingestion.

Note also that the method included here is not totally foolproof so do keep an eye on the log ingestion anyway - but at least for us, it's become much more reliable.

You may also need to manipulate the failureThreshold property of the deployment

@jtomaszewski
Copy link

We've just met the very same problem. We'll probably use that PR as well.
Something needs to be fixed about this for sure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants