Skip to content

Oz Heartbeating #4

@sjmackenzie

Description

@sjmackenzie

Overall theory of a heartbeat:

  • Each node should have its own heartbeat.
  • Failure detection works at the level of single language entities (class instantiations, ports, etc)
  • Each node determines if a remote node fails.
  • Failure detection is done on each node,
  • Nodes don't tell other remote nodes that a certain remote node has failed
  • If there is a failure (permFail) the language level entities adapt, wait, restart, etc (this is programmed on another level)
  • A single slow node should not slow down the failure detection of other nodes
  • Failure detection should be as fast as possible
  • If a delay is increased a certain number of times for a given node only then the failure detector should indicate a failure for that node.
  • The heartbeat is adaptive
  • A forever loop sending messages.
  • If no messages are present within a delay, then increase the delay.
  • If there are messages then decrease the delay
  • increase or decrease the delay such that it is just high enough so that successful communication is achieved
  • delay is not a timeout, yet the delay should be as small as possible

Some points I want to clarify:

Two or more approaches are available - note each approach uses asynchronous io (ie zeromq or nanomsg)

  1. single phase protocol approach
for each remote node set their delay to say 3 secs to allow for initial TCP connections
queue a heartbeat message to each remote node
send queue
     spawn an oz thread for each node and loop infinitely
          sleep for the time delay associated with remote node
          poll for heartbeats from that node
          if poll is empty
               increase node delay
               if node delay increased X times
                     flag remote node with permFail state
               else 
                     flag remote node with tempFail state
               queue heartbeat message to remote node
           else
                flag remote node alive state
                decrease node delay (factor in how many heartbeats are there)
                queue actor/language_entity messages to remote node
                queue heartbeat message to remote node
           send_queue
           poll for actor messages and process if any
           loop
  1. use a two phase protocol
for each remote node set their delay to say 3 secs to allow for initial TCP connections
queue a heartbeat message to each remote node
send queue
     spawn an oz thread for each node and loop infinitely
          sleep for the time delay associated with remote node
          poll for heartbeat responses from that node
          if poll is empty
               increase node delay
               if node delay increased X times
                     flag remote node with permFail state
               else 
                     flag remote node with tempFail state
               queue heartbeat message to remote node
           else
                flag remote node alive state
                decrease node delay
                queue actor messages to remote node
                queue heartbeat message to remote node
           send_queue
           poll for actor messages and process if any
           loop

I believe version two will be slower as it has to wait for a round trip journey, also heartbeats could be lost on the wire. Whereas version one operates on the data at hand therefore faster to detect failure and send messages.

Please check the logic, also am I missing anything?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions