Skip to content

Try to avoid isolated node split-brain #39

@kaiyou

Description

@kaiyou

There are two kinds of split-brain:

  • a fairly rare case of half-half split-brain, where halft the nodes are completely disconnected from the other half for long enough so a split-brain occur, this might sometimes happen if entire LANs are connected using wesher ;
  • a much more common case of isolated node split-brain, where a single node loses connection to the reste of the cluster, and acts as an isolated node from now on.

There would be a general fix for both of these, which involves keeping track of some super-nodes (maybe all known nodes ?) and regularly try to join these nodes to the memberlist with some kind of backoff mechanism, and maybe forget them after some (fairly long) time.

This would probably require some complex code, should not be run from inside the main loop to avoid deadlocking, and quiet frankly: it sounds scary to me. I would love to get into it later, but I am not familiar enough with the wesher code for now.

However, the second more common case has a quick and (not so) dirty fix. If the memberlist becomes empty, it is usually safe to consider we are facing a split-brain, and more generally we know for sure we are in a deadend (until some nodes leaves/joins that is). So I think it would be safe to simply fatal-exit. Then, it is the service manager responsibility to handle restarting if required by the admin.

I have a patch working for this, and have tested it using systemd unit files with success. I need to isolate the changes and provide a PR.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions