Motivation
Ignite clusters are required to have a dynamically changing topology due to the nature of distributed systems - nodes can fail and need to be restarted at any time, so new nodes can be introduced or removed from the cluster. Users expect that topology changes do not break the consistency of the cluster and it remains operational.
Description
Problem statement
This document describes the process of creating a cluster of Ignite nodes and adding new nodes to it. It describes the following concepts:
- Cluster setup - the process of assembling a predefined set of nodes into a cluster and preparing it for the next lifecycle steps;
- Cluster initialization - according to the Node Lifecycle description, a cluster can exist in a "zombie" state, until an "init" command is sent by the system administrator;
- Node validation - in order to maintain the cluster configuration in a consistent state, the joining nodes should be compatible with all of the existing cluster nodes. Compatibility is ensured by validating nodes' properties, which may include some local properties (Ignite product version, Cluster Tag), as well as distributed properties from the Meta Storage. NB: it is currently unclear what properties should be retrieved from the Meta Storage, however, it was decided to keep this concept for possible future features implementation.
Terminology
Meta Storage is a subset of cluster nodes hosting a Raft group responsible for storing a master copy of cluster metadata.
Cluster Management Group
Cluster Management Group or CMG is a subset of cluster nodes hosting a Raft group. CMG leader is responsible for orchestrating the node join process.
The init command moves a cluster from the idle state into the running state.
Initialized and empty nodes
An initialized node is a node that and therefore possesses the Cluster Tag and the location of the Meta Storage . and therefore does not possess the aforementioned properties.
Idle and running cluster
An idle cluster is a cluster that has been assembled for the first time and has never received the init command, therefore the Meta Storage of this cluster does not exist.
Empty and initialized node
Every Ignite node starts in the empty state. After joining the cluster and passing the validation step, a node obtains the location of the Meta Storage and transitions to the initialized state.
Cluster Tag
A cluster tag is a string that uniquely identifies a cluster. It is generated once per cluster and is distributed across the nodes during the node join process. The purpose of a cluster tag is to understand whether a joining node used to be a member of another cluster, in which case it should be rejected during
A cluster tag consists of two parts:
- . Its purpose is to make the debugging and error reporting easier.
- Its purpose is to ensure that Cluster Tags are different between different clusters.
Physical topology
Physical topology consists of nodes that can communicate with each other on the network level. Nodes enter the topology though a network discovery mechanism, currently provided by the SWIM protocol. However, such nodes may not yet have passed the validation step, so not all nodes from the physical topology can participate in cluster-wide activities. In order to do that, a node must enter the baseline topology.
Baseline topology
Baseline topology consists of nodes that have passed the validation step and are therefore able to participate in cluster-wide activities. Baseline topology is maintained by the Cluster Management Group.
Implementation details
Join Coordinator election
NOTE: this section is not described as thorough as it needs to be in order to save some time and finalize the election protocol after a discussion.
Before the nodes can start joining a cluster, a node should be elected as the Join Coordinator. For the sake of simplicity, the following algorithm is proposed, which can later be replaced with something more sophisticated:
- Given a list of initial cluster members, choose the "smallest" address (for example, using an alphanumeric order), which will implicitly be considered the Join Coordinator. This requires all nodes to have the IP Finder configuration (used to obtain the initial cluster member list) to be identical on all initial cluster members.
- If the "smallest" address is unavailable, all other nodes should fail to start after a timeout and should be manually restarted.
DISCUSSION NEEDED:
DISCUSSION NEEDED:
DISCUSSION NEEDED: This way, it will be possible
TODO: describe coordinator re-election.
Initial cluster setup
This section describes the process of assembling a new Ignite cluster from a set of nodes.
- Initial set of nodes is configured, including the following properties:
- (provided by the IP Finder).
- A Join Coordinator is elected (see Join Coordinator election);
- The Join Coordinator generates a Cluster Tag (if it hasn't already been generated);
- All other nodes connect to the Coordinator and provide the following information:
- All of the aforementioned parameters get compared with the information stored on the Coordinator, and if all of the parameters are the same, the joining node is allowed into the cluster. Otherwise, the joining node is rejected.
- Join Coordinator adds the new node to the list of s. This can be implemented using different approaches, for example it is possible to piggyback on the current Cluster Membership implementation (ScaleCube) and to modify its transport protocol so that only valid nodes will be visible to others (similar to the implementation of the Handshake Protocol). Another way is to explicitly store and maintain a list of valid nodes in the Meta Storage.
- If the joining node is allowed to enter the topology, it receives the following parameters from the Coordinator:
- Cluster Tag.
DISCUSSION NEEDED: What to do if the Coordinator dies during any step of the setup process.
Cluster initialization
This section describes the next step of the Ignite cluster lifecycle: moving the nodes from the "zombie" state into the "active" state.
- After the cluster has been established, it remains in the "zombie" state, until the "init" command arrives;
- "Init" command is sent by the administrator either directly to the Join Coordinator, or to any other node, in which case it should be redirected to the Join Coordinator;
- The "init" command should specify the following information:
- List of addresses of the nodes that should host the Meta Storage Raft group (a.k.a. Meta Storage Configuration).
- The Join Coordinator generates the initial Meta Storage Topology Version property (if it hasn't already been generated);
- The Join Coordinator atomically broadcasts the Meta Storage Configuration to all valid nodes in the topology. If this step is successful, then Meta Storage is considered to be initialized and available;
- The Join Coordinator persists the following information into the Meta Storage (therefore propagating it to all nodes):
- Meta Storage Topology Version.
DISCUSSION NEEDED: What to do if the Coordinator dies during any step of the initialization process.
Adding a new node to a running cluster
This section describes a scenario when a new node wants to join an initialized cluster. Depending on the node configuration, there exist multiple possible scenarios:
Empty node joins a cluster
If an empty node tries to join a cluster the following process is proposed:
- The new node connects to a random node. If this node is not the Join Coordinator, the random node redirects the new node to it.
- The new node provides its Ignite product version and is validated on the Coordinator (see Initial cluster setup).
- If the validation is successful, the new node is added to the list of validated nodes and receives the list of Meta Storage hosts as the response.
In case of errors or a timeout, the whole process should be repeated by the joining node.
Initialized node joins a cluster
If an initialized node tries to join a cluster the following process is proposed:
- The new node connects to a random node. If this node is not the Join Coordinator, the random node redirects the new node to it.
- The joining node connect to the Coordinator and provide the following information:
- Ignite product version;
- Cluster Tag;
- Meta Storage Topology Version.
- If the Cluster Tag and the Ignite product version do not match with the ones stored on the Coordinator, the node is rejected. If they match, the Meta Storage Topology Version is compared. If the node's Meta Storage Topology Version is smaller or equal to the Version on the Coordinator, it is allowed to enter the topology and the updates its Meta Storage Configuration.
DISCUSSION NEEDED: What to do if the joining node's Meta Storage Topology Version is greater than the Version stored on the Coordinator?
Changes in API
TODO
Risks and Assumptions
- "Init" command is not fully specified and can influence the design.
- Proposed implementation does not discuss message encryption and security credentials.
Discussion Links
TODO
Reference Links
- IEP-73: Node startup
- https://github.com/apache/ignite-3/blob/main/modules/runner/README.md
- IEP-67: Networking module
- IEP-61: Common Replication Infrastructure
Tickets
Unable to render Jira issues macro, execution error.