Skip to content

transparency in node health/status information and state changes #112

@FliesLikeABrick

Description

@FliesLikeABrick

While looking into how node liveness is determined (API's report of alive_ipv4 and alive_ipv6), I found myself wanting to be able to estimate the age of the health data for a system and understand if the current API response is reflective of the system health, or if a change is likely pending in the next 24 hours (next ring-admin run). One or more of the following would be helpful:

  • On a system, store more than just the latest status.json. This can be helpful for any user on the node to tell whether something on the system is intermittently unhealthy such as IPv4 or IPv6 connectivity. ring-health is run every 60 minutes from cron, but only the latest output is stored at /var/www/ring/status.json. Adding more history stored on the node would allow for investigation into what data may have changed since the last report to the API.
  • In the API, perhaps add a method and route to access the data from the health table? Speaking of, can someone provide the schema for the health table or add it to the SCHEMA in ring-admin?

With a bit of support, I can begin work on a PR for one or both of the items listed above, which could then enable research into some other contributions.

Other questions:

  • The logic in ring-admin to update alive_v4 and alive_v6 in the machines table is contained in ansible_process(). What cron job or other trigger results in ansible_process() being called? The closest cronjob I see is for purge machines however that appears to be a cleanup rather than calling ansible_process()
  • ring-admin will skip marking nodes as dead_v4/dead_v6 if more than 10 are detected down in a single run. Is there any visible report of this occurring or does this go to /dev/null? The concern would be if 10+ nodes legitimately fail in a single run (however often ansible_process() is run), that ring-admin will never be able to catch up on subsequent changes in state unless enough machines recover

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions