Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. Cluster setup - the process of assembling a predefined set of nodes into a cluster and preparing it for the next lifecycle steps;
  2. Cluster initialization - according to the Node Lifecycle description, a cluster can exist in a "zombie" state, until an "init" command is sent by the system administratorafter setting up the cluster, it should be transferred to a state when it is ready to serve user requests;
  3. Node validation - in order to maintain the cluster configuration in a consistent state, the joining nodes should be compatible with all of the existing cluster nodes. Compatibility is ensured by validating nodes' properties, which may include some local properties (Ignite product version, Cluster Tag), as well as distributed properties from the Meta Storage. NB: it is currently unclear what properties should be retrieved from the Meta Storage, however, it was decided to keep this concept for possible future features implementation.

Terminology

Meta Storage

Meta Storage is a subset of cluster nodes hosting a Raft group responsible for storing  a master copy of cluster metadata.

...

Idle and running cluster

An idle cluster is a cluster that has been assembled for the first time and has never received the init command, therefore the Meta Storage of this cluster does not exist. Acluster can be considered running if it has obtained the information about the Meta Storage and CMG location.

Empty and initialized node

...

Baseline topology consists of nodes that have passed the validation step and are therefore able to participate in cluster-wide activities. Baseline topology is maintained by the Cluster Management Group.

Implementation details

Join Coordinator election

NOTE: this section is not described as thorough as it needs to be in order to save some time and finalize the election protocol after a discussion.

Before the nodes can start joining a cluster, a node should be elected as the Join Coordinator. For the sake of simplicity, the following algorithm is proposed, which can later be replaced with something more sophisticated:

  1. Given a list of initial cluster members, choose the "smallest" address (for example, using an alphanumeric order), which will implicitly be considered the Join Coordinator. This requires all nodes to have the IP Finder configuration (used to obtain the initial cluster member list) to be identical on all initial cluster members.
  2. If the "smallest" address is unavailable, all other nodes should fail to start after a timeout and should be manually restarted.

DISCUSSION NEEDED: What to do when constructing a cluster from some amount of stopped nodes with different Meta Storage configuration? Should it be overridden by the "init" command?

DISCUSSION NEEDED:  What if we are restarting a cluster and also introducing a new node? What if it is elected as the coordinator?

DISCUSSION NEEDED:  Can we actually use Raft and use its leader as the Join Coordinator? Can we use Raft's log to store the Meta Storage Topology Version? This way, it will be possible

TODO: describe coordinator re-election.

Protocol description

Initial cluster setup

This section describes the process of assembling a new Ignite cluster from a set of empty nodes.

  1. Initial set of nodes is configuredstarted, including providing the following properties:
    1. A subset of nodes (minimum 1, more can be specified to increase startup reliability) in the initial cluster setup (, provided by the IP Finder).
  2. A Join Coordinator is elected (see Join Coordinator election);
  3. The Join Coordinator generates a Cluster Tag (if it hasn't already been generated);
  4. All other nodes connect to the Coordinator and provide the following information:
    1. Ignite product version;
    2. Cluster Tag, if any (if a node has obtained it at any time during its life);
    3. Meta Storage Topology version, if any (if a node has obtained it at any time during its life).
  5. All of the aforementioned parameters get compared with the information stored on the Coordinator, and if all of the parameters are the same, the joining node is allowed into the cluster. Otherwise, the joining node is rejected.
  6. Join Coordinator adds the new node to the list of validated nodes. This can be implemented using different approaches, for example it is possible to piggyback on the current Cluster Membership implementation (ScaleCube) and to modify its transport protocol so that only valid nodes will be visible to others (similar to the implementation of the Handshake Protocol). Another way is to explicitly store and maintain a list of valid nodes in the Meta Storage.
  7. If the joining node is allowed to enter the topology, it receives the following parameters from the Coordinator:
    1. Cluster Tag.

...

    1. an IP Finder. Concrete IP Finder implementations are used to obtain the seed members from different environments and are specified either via the configuration or the CLI.
  1. The nodes assemble into a physical topology using a network discovery protocol (e.g. SWIM) and the provided seed members.
  2. An init command is sent by a user to a single node in the cluster, providing the following information:
    1. Addresses of the nodes that will host the Meta Storage;
    2. Addresses of the nodes that will comprise the Cluster Management Group. It is possible for both of these address sets to be the same.
  3. The node, that has received the command, propagates it to all members of the physical topology that were specified in the init command. These members should start the corresponding Raft groups and, after the group leaders are elected, the initial node should return a response to the user. In case of errors, Raft groups should be removed and an error response will be returned to the user. If no response has been received, the user should retry sending the command with the same parameters to a different node.  
  4. As soon as the CMG leader is elected, the leader initializes the CMG state by applying a Raft command (let's call it ClusterInitCommand), which includes:
    1. A generated Cluster Tag;
    2. Ignite product version.
  5. After the command has been applied, the leader sends a message to all nodes in the physical topology, containing the location of the CMG nodes.
  6. Upon receiving the message, each node sends a join request to the CMG leader, which consists of:
    1. Protocol version;
    2. Ignite product version;
  7. Information from the join requests gets validated on the leader and, if the properties are equal to the CMG state, a successful response is sent, containing:
    1. Addresses of the Meta Storage nodes;
    2. Cluster Tag.
      If the properties do not match, an error response is sent and the joining node is rejected.
  8. If a node is validated successfully, the CMG leader issues a Raft command (e.g. AddNodeCommand), which adds the node to the baseline topology.

Cluster initialization

This section describes the next step of the Ignite cluster lifecycle: moving the nodes from the "zombie" state into the "active" state.

...

Adding a new node to a running cluster

This section describes a scenario when a new node wants to join an initialized cluster. Depending on the node configuration, state of the node or the cluster there exist multiple 4 possible scenarios:

Empty node joins

...

an idle cluster

This scenario is equivalent to the initial cluster setup.

Initialized node joins an idle cluster

Such nodes should never be able to join the cluster, because it will fail the cluster tag validation step.

Empty node joins a running cluster

Initialized node joins a running cluster

Implementation details

TODO

If an empty node tries to join a cluster the following process is proposed:

  1. The new node connects to a random node. If this node is not the Join Coordinator, the random node redirects the new node to it.
  2. The new node provides its Ignite product version and is validated on the Coordinator (see Initial cluster setup).
  3. If the validation is successful, the new node is added to the list of validated nodes and receives the list of Meta Storage hosts as the response.

In case of errors or a timeout, the whole process should be repeated by the joining node.

Initialized node joins a cluster

If an initialized node tries to join a cluster the following process is proposed:

  1. The new node connects to a random node. If this node is not the Join Coordinator, the random node redirects the new node to it.
  2. The joining node connect to the Coordinator and provide the following information:
    1. Ignite product version;
    2. Cluster Tag;
    3. Meta Storage Topology Version.
  3. If the Cluster Tag and the Ignite product version do not match with the ones stored on the Coordinator, the node is rejected. If they match, the Meta Storage Topology Version is compared. If the node's Meta Storage Topology Version is smaller or equal to the Version on the Coordinator, it is allowed to enter the topology and the updates its Meta Storage Configuration.

DISCUSSION NEEDED: What to do if the joining node's Meta Storage Topology Version is greater than the Version stored on the Coordinator?

Changes in API

TODO

Risks and Assumptions

...