-
Notifications
You must be signed in to change notification settings - Fork 8
Open
Labels
Description
Overall theory of a heartbeat:
- Each node should have its own heartbeat.
- Failure detection works at the level of single language entities (class instantiations, ports, etc)
- Each node determines if a remote node fails.
- Failure detection is done on each node,
- Nodes don't tell other remote nodes that a certain remote node has failed
- If there is a failure (permFail) the language level entities adapt, wait, restart, etc (this is programmed on another level)
- A single slow node should not slow down the failure detection of other nodes
- Failure detection should be as fast as possible
- If a delay is increased a certain number of times for a given node only then the failure detector should indicate a failure for that node.
- The heartbeat is adaptive
- A forever loop sending messages.
- If no messages are present within a delay, then increase the delay.
- If there are messages then decrease the delay
- increase or decrease the delay such that it is just high enough so that successful communication is achieved
- delay is not a timeout, yet the delay should be as small as possible
Some points I want to clarify:
Two or more approaches are available - note each approach uses asynchronous io (ie zeromq or nanomsg)
- single phase protocol approach
for each remote node set their delay to say 3 secs to allow for initial TCP connections
queue a heartbeat message to each remote node
send queue
spawn an oz thread for each node and loop infinitely
sleep for the time delay associated with remote node
poll for heartbeats from that node
if poll is empty
increase node delay
if node delay increased X times
flag remote node with permFail state
else
flag remote node with tempFail state
queue heartbeat message to remote node
else
flag remote node alive state
decrease node delay (factor in how many heartbeats are there)
queue actor/language_entity messages to remote node
queue heartbeat message to remote node
send_queue
poll for actor messages and process if any
loop
- use a two phase protocol
for each remote node set their delay to say 3 secs to allow for initial TCP connections
queue a heartbeat message to each remote node
send queue
spawn an oz thread for each node and loop infinitely
sleep for the time delay associated with remote node
poll for heartbeat responses from that node
if poll is empty
increase node delay
if node delay increased X times
flag remote node with permFail state
else
flag remote node with tempFail state
queue heartbeat message to remote node
else
flag remote node alive state
decrease node delay
queue actor messages to remote node
queue heartbeat message to remote node
send_queue
poll for actor messages and process if any
loop
I believe version two will be slower as it has to wait for a round trip journey, also heartbeats could be lost on the wire. Whereas version one operates on the data at hand therefore faster to detect failure and send messages.
Please check the logic, also am I missing anything?