Gossip Protocol

Gossip Protocol

Some of these text may be interwined with specifities of the Gossip Protocol within Akka Cluster .

Context: TODO

At a regular interval, each member sends their view of the cluster state to a random node, including:

  • The status of each member
  • If each member has seen this version of the cluster state.

Eventually consistent as after some time (aka convergence), all nodes will share the same state. Each node decides if it has reached convergence.

After it, each node will have the same ordered set of nodes:

  • Node 1
  • Node 2
  • Node 3

Then the first eligible node, will become the leader and will perform leader’s duties such as notify member state changes

Note: that each leader may change after each round if the cluster membership has changed.

Member Lifecycle - Happy Path #

Joining -> Up -> Leaving -> Exiting -> Removed (tombstoned)

In this case:

  • Once all nodes know that all nodes know that a new node is “Joining”, it marks it as “Up”.
  • A leaving node sets itself as Leaving, then once all nodes know it, the leader marks it as Exiting.
  • Once all leaders know that a node is Exiting, the leader sets the node as Removed.

Member Lifecycle - Unhappy Path #

Joining -> Up -> Leaving -> Exiting -> Removed (tombstoned)
  |        |        |          |         /|\
  |--disconnected--------------|          |
 \|/                                      |
Unreachable (F) ----> Down (F) ------------

When a new member is disconnected, it is marked as Unreachable until it recovers. If it is permanent, it is flagged as Down and it moves to state Removed and can never come back.

Each node is monitored for failures by at most 5 nodes using Heartbeat . Once a member is deemed unreachable, then than information will be gossiped to the cluster. The member will remain like that until all nodes flag the node as Reachabe again.

Reasons:

  • Crashes
  • Heavy load
  • Network Partition or failure

Impact of Unreachable Nodes #

Convergence will not be possible therefore no leaders will be elected therefore new nodes cannot fully join the cluster. This will happen until the node is marked as Down (and then Removed).

In this moment, potential new members will have a state WeaklyUpMember, which can transition to Up once convergence is complete. Note that, this member can be used by applications but should not if consistency is important (Cluster Shards or Cluster Singletons).

TODO #

  • Check if Heartbeat and Gossip Protocol are always interwined.