- 
                Notifications
    You must be signed in to change notification settings 
- Fork 60
Description
There are two kinds of split-brain:
- a fairly rare case of half-half split-brain, where halft the nodes are completely disconnected from the other half for long enough so a split-brain occur, this might sometimes happen if entire LANs are connected using wesher ;
- a much more common case of isolated node split-brain, where a single node loses connection to the reste of the cluster, and acts as an isolated node from now on.
There would be a general fix for both of these, which involves keeping track of some super-nodes (maybe all known nodes ?) and regularly try to join these nodes to the memberlist with some kind of backoff mechanism, and maybe forget them after some (fairly long) time.
This would probably require some complex code, should not be run from inside the main loop to avoid deadlocking, and quiet frankly: it sounds scary to me. I would love to get into it later, but I am not familiar enough with the wesher code for now.
However, the second more common case has a quick and (not so) dirty fix. If the memberlist becomes empty, it is usually safe to consider we are facing a split-brain, and more generally we know for sure we are in a deadend (until some nodes leaves/joins that is). So I think it would be safe to simply fatal-exit. Then, it is the service manager responsibility to handle restarting if required by the admin.
I have a patch working for this, and have tested it using systemd unit files with success. I need to isolate the changes and provide a PR.