Skip to content

[RFC] feat: add a mechanism to watch for problems in the logs #56

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

fcuny-rbx
Copy link
Collaborator

So far we've focused on reporting problems from external health checks
or by relying on some host level metrics. This change introduces a
different source for problems: logs.

A configuration file (or multiple ones) can be provided to the detector,
with a source (for now only journald) and a list of rules. A log watcher
is then created at startup, and events in the logs are analyzed as they
come and when a log event matches one of the rule, a counter for the
problem associated with the rule is increased.

So far we've focused on reporting problems from external health checks
or by relying on some host level metrics. This change introduces a
different source for problems: logs.

A configuration file (or multiple ones) can be provided to the detector,
with a source (for now only journald) and a list of rules. A log watcher
is then created at startup, and events in the logs are analyzed as they
come and when a log event matches one of the rule, a counter for the
problem associated with the rule is increased.
@fcuny-rbx
Copy link
Collaborator Author

fcuny-rbx commented Jul 19, 2022

With the following configuration files:

vagrant@vagrant:~/go/src/github.com/Roblox/nomad-node-problem-detector$ sudo cat /var/lib/nnpd/*json
        {
                "type": "log",
                "source": "journald",
                "syslog_identifier": "systemd",
                "rules": [
                        {
                                "name": "nomad_restart",
                                "pattern": "Started nomad server"
                        },
                        {
                                "name": "docker_restart",
                                "pattern": "Starting Docker Application Container Engine"
                        },
                        {
                                "name": "auth",
                                "pattern": "pam_unix"
                        }
                ]
        }
        {
                "type": "log",
                "source": "journald",
                "syslog_identifier": "sudo",
                "rules": [
                        {
                                "name": "pam",
                                "pattern": "pam"
                        }
                ]
        }

And restarting both nomad and docker:

$ sudo systemctl restart docker
$ sudo systemctl restart nomad

we get the following metrics:

vagrant@vagrant:~/go/src/github.com/Roblox/nomad-node-problem-detector$ curl -s localhost:8083/v1/metrics/|grep log
# HELP npd_detector_log_problem_count Number of time a specific log problem was reported
# TYPE npd_detector_log_problem_count counter
npd_detector_log_problem_count{check="docker_restart"} 1
npd_detector_log_problem_count{check="nomad_restart"} 1
npd_detector_log_problem_count{check="pam"} 4

We can add more complex checks, to report on various issues that are present in the journal (kernel hanging up, hardware failures, etc).

A few things worth mentioning:

  • this requires libsystemd-dev in order to work (there might be other ways to consume from journald, if that dependency is a non starter, I'll look into them
  • because it relies on journald, developing on a mac is a bit more painful
  • there are no tests, I'll work on them if we agree with the approach
  • I'm happy to break this into smaller chunks if we want to introduce the change in small parts

@github-actions
Copy link

CLA Signature Action: All authors have signed the CLA. You may need to manually re-run the blocking PR check if it doesn't pass in a few minutes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant