Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The "init" command is supposed to move the cluster from the "zombie" state into the "active" state. It is supposed to have the following characteristics (note that the "init" command has not been specified at the moment of writing and is out of scope of this document, so all statements are approximate and can change in the future):

...

This document uses a notation of "initialized" and "empty" nodes. An initialized node is a node that has received the "init" message , an empty node is a node that has entered the topology (i.e. has passed the first validation step), but has not yet configured its Meta Storage componentsometime in its lifetime and therefore possesses the cluster tag and the meta storage version. An empty node has never received the "init" command and does not possess the aforementioned properties.

Meta Storage version

Meta Storage version is a totally ordered property that should be used to compute the most "recent" state of the Meta Storage configuration. A possible implementation can be a monotonically increasing counter, which is increased every time the Meta Storage configuration (e.g. addresses of nodes that host the Meta Storage Raft group) is updated.

Cluster tag

A cluster tag is a string that uniquely identifies a cluster (e.g. a UUID). It is generated once per cluster and is distributed across the nodes during the "init" phase. The purpose of a cluster tag is to understand whether a joining node used to be a member of another cluster, in which case its Meta Storage version is not comparable and the joining node should be rejected.

...

  1. An empty node enters a cluster where all nodes are empty.
  2. An empty node enters a cluster where all nodes are initialized.
  3. An initialized node enters a cluster where all nodes are empty.
  4. An initialized node with a Meta Storage of version X enters a cluster where all nodes have the Meta Storage of version Y.

First validation step

As described. This step will be common for all scenarios regarding the state of the Meta Storage and looks like the following:

Common logic

A joining node tries to enter the topology. It is possible to piggyback on the transport of the membership protocol in order to exchange validation messages before allowing to send membership messages (similar to the handshake protocol). During this step it sends some information (cluster tag, Meta Storage version, node version) to a random node and gets validated (more details below). After this step is complete, the joining node becomes visible through the Topology Service, therefore establishing an invariant that visible topology will always consist of nodes that have passed the first validation step. Possible issues: there can be a race condition when multiple conflicting nodes join at the same time, in which case only the first node to join will be valid. This can be considered expected behavior, because such situations can only occur during the initial set up of a cluster, which is a manual process and requires manual intervention anyway.

Empty node joins a cluster

If an empty node tries to join a cluster the The following process is proposed as the join protocol.:

  1. It connects to a random node, sends the available local validation information and enters the topology, if it gets accepted.
  2. The following scenarios can then happen:
    1. The random node is initialized. The joining node should then retrieve the information, that was broadcasted by the "init" command, become initialized and finish the join process.
    2. The random node is empty because
    If
    1. the cluster has not yet been initialized
    (i.e. it has not received the "init" command), the node persists in the zombie
    1. . In this case the node finishes the join process and remains in the "zombie" state until the "init" command arrives
    (starting the process approximately described in <link>). After the command commenced, the nodes gain access to the Meta Storage and should be validated against it. Possible issues: it is unclear, what node should perform the validation step, maybe it should be the "init" command coordinator.
  3. If the cluster has been initialized, the node sends a request to a random node to obtain the information that had been propagated by the "init" command and to get validated against the Meta Storage.
  4. After the node passed both validation steps, it should be added to a list of valid nodes in the Meta Storage (the choice of the Meta Storage is dictated by the requirement that this list should be consistent on every node in the cluster).

...

    1. .
    2. The random node is empty because it hasn't finished the join process itself. In this case the random node should send a corresponding message, and the joining node should choose another random node and repeat the process.

Initialized node joins a cluster

If an initialized node tries to join a cluster the following process is proposed:

  1. It connects to a random node and sends the available local validation information (including the cluster tag and the Meta Storage version).
  2. The following scenarios can then happen:
    1. The random node is initialized and the cluster tags do not match. The joining node must be rejected.
    2. The random node is initialized, the cluster tags match, local Meta Storage version is "smaller" than the remote. The node joins the topology and updates its Meta Storage configuration, thus ending the joining process.
    3. The random node is initialized, the cluster tags match, local Meta Storage version is "larger" than the remote. ???
    4. The random node is empty because the cluster has not yet been initialized. ???
    5. The random node is empty because it hasn't finished the join process itself. In this case the random node should send a corresponding message, and the joining node should choose another random node and repeat the process.

Changes in API (WIP)

NetworkTopologyService

Current TopologyService  will be renamed to NetworkTopologyService . It is proposed to extend this service to add validation handlers that will validate the joining nodes on the network level.

...