...
ID | IEP-77 | ||||||||||
Author | Aleksandr Polovtsev | ||||||||||
Sponsor | Alexey Scherbakov | ||||||||||
Created |
| ||||||||||
Status |
|
Table of Contents |
---|
When building a cluster of Ignite nodes, users need to be able to establish some restrictions on the member nodes based on cluster invariants in order to avoid breaking the consistency of the cluster. Such restrictions may include: having the same product version across the cluster, having consistent table and memory configurationsIgnite clusters are required to have a dynamically changing topology due to the nature of distributed systems - nodes can fail and need to be restarted at any time, so new nodes can be introduced or removed from the cluster. Users expect that topology changes do not break the consistency of the cluster and it remains operational.
This document describes the process of creating a new node joining a cluster, which consists of a validation phase, where a set of rules are applied to determine whether the incoming node is able to enter the current topology. Validation rules may include node-local information (e.g. product version and the cluster tag) as well as cluster-wide information (discussion needed: are there any properties that need to be retrieved from the Meta Storage?), which means that the validation component may require access to the Meta Storage (it is assumed that Meta Storage contains the consistent cluster-wide information, unless some other mechanism is proposed). The problem is that, according to the Node Lifecycle description, a cluster can exist in a "zombie" state, during which the Meta Storage is unavailable. This means the the validation process can be split into 2 steps:
Apart from the 2-step validation, there are also the following questions that need to be addressed:
The "init" command is supposed to move the cluster from the "zombie" state into the "active" state. It is supposed to have the following characteristics (note that the "init" command has not been specified at the moment of writing and is out of scope of this document, so all statements are approximate and can change in the future):
This document uses a notation of "initialized" and "empty" nodes. An initialized node is a node that has received the "init" message sometime in its lifetime and therefore possesses the cluster tag and the Meta Storage Topology version. An empty node is a node that has never received the "init" command and does not possess the aforementioned properties.
Meta Storage Topology version is a property that should be used to compute the most "recent" state of a given Meta Storage configuration. At the moment of writing, Meta Storage configuration consists of a list of cluster node names that host the Meta Storage Raft group. A possible implementation can be a monotonically increasing counter, which is increased each time this list is updated.
The node join process is proposed to be made centralized: a single node is granted the role of the Join Coordinator and is responsible for the following:
...
cluster of Ignite nodes and adding new nodes to it. It describes the following concepts:
Meta Storage is a subset of cluster nodes hosting a Raft group responsible for storing a master copy of cluster metadata.
Cluster Management Group or CMG is a subset of cluster nodes hosting a Raft group. CMG leader is responsible for orchestrating the node join process.
The init command is issued by a user with a CLI tool and moves a cluster from the idle state into the running state.
An idle cluster is a cluster that has been assembled for the first time and has never received the init command, therefore the Meta Storage of this cluster does not exist. Acluster can be considered running if it has obtained the information about the Meta Storage and CMG location.
Every Ignite node starts in the empty state. After joining the cluster and passing the validation step, a node obtains the location of the Meta Storage and transitions to the initialized state
...
.
A cluster tag is a string that uniquely identifies a cluster. It is generated once per cluster and is distributed across the nodes during the "init" phasenode join process. The purpose of a cluster tag is to understand whether a joining node used to be a member of another cluster, in which case its Meta Storage Topology version is not comparable and the joining node it should be rejected. Together with the Meta Storage Topology version, it creates a partial ordering that allows to compare different configuration versions.
A cluster tag should consist consists of two parts:
Before the nodes can start joining a cluster, a node should be elected as the Join Coordinator. For the sake of simplicity, the following algorithm can be proposed, which can later be replaced with something more sophisticated:
discussion needed: What to do when constructing a cluster from some amount of stopped nodes with different Meta Storage configuration? Should it be overridden by the "init" command?
discussion needed: What if we are restarting a cluster and also introducing a new node? What if it is elected as the coordinator?
discussion needed: Can we actually use Raft and use its leader as the Join Coordinator? Can we use Raft's log to store the Meta Storage Topology Version? This way, it will be possible
TODO: describe coordinator re-election.
discussion needed: What to do if the Coordinator dies during any step of the setup process.
Physical topology consists of nodes that can communicate with each other on the network level. Nodes enter the topology though a network discovery mechanism, currently provided by the SWIM protocol. However, such nodes may not yet have passed the validation step, so not all nodes from the physical topology can participate in cluster-wide activities. In order to do that, a node must enter the logical topology.
Logical topology consists of nodes that have passed the validation step and are therefore able to participate in cluster-wide activities. Logical topology is maintained by the Cluster Management Group.
This section describes the process of assembling a new Ignite cluster from a set of empty nodes.
ClusterInitCommand
), which includes:AddNodeCommand
), which adds the node to the logical topology.An example of this flow can be found on the diagram below. Some initialization steps are omitted for Node C as they are identical to the corresponding steps on Node B.
discussion needed: What to do if the Coordinator dies during any step of the initialization process.
This section describes a scenario when a new node wants to join an initialized cluster. Depending on the node configuration, state of the node or the cluster there exist multiple 4 possible scenarios:
...
If an empty node tries to join a cluster the following process is proposed:
...
This scenario is equivalent to the initial cluster setup.
Such nodes should never be able to join the cluster, because it will fail the cluster tag validation step.
...
discussion needed: What to do if the joining node's Meta Storage Topology Version is greater than the Version stored on the Coordinator.
Current TopologyService
will be renamed to NetworkTopologyService
. It is proposed to extend this service to add validation handlers that will validate the joining nodes on the network level.
Code Block | ||
---|---|---|
| ||
/**
* Class for working with the cluster topology on the network level.
*/
public interface NetworkTopologyService {
/**
* This topology member.
*/
ClusterNode localMember();
/**
* All topology members.
*/
Collection<ClusterNode> allMembers();
/**
* Handlers for topology events (join, leave).
*/
void addEventHandler(TopologyEventHandler handler);
/**
* Returns a member by a network address
*/
@Nullable ClusterNode getByAddress(NetworkAddress addr);
/**
* Handlers for validating a joining node.
*/
void addValidationHandler(TopologyValidationHandler handler);
} |
The new service will have the same API, but will work on top of the Meta Storage, and will provide methods to work with the list of validated nodes. In addition to that, it will perform the validation of incoming nodes against the Meta Storage, based on the registered validation handlers.
Code Block | ||
---|---|---|
| ||
/**
* Class for working with the cluster topology on the Meta Storage level. Only fully validated nodes are allowed to be present in such topology.
*/
public interface TopologyService {
/**
* This topology member.
*/
ClusterNode localMember();
/**
* All topology members.
*/
Collection<ClusterNode> allMembers();
/**
* Handlers for topology events (join, leave).
*/
void addEventHandler(TopologyEventHandler handler);
/**
* Returns a member by a network address
*/
@Nullable ClusterNode getByAddress(NetworkAddress addr);
/**
* Handlers for validating a joining node.
*/
void addValidationHandler(TopologyValidationHandler handler);
} |
TopologyService
will depend on the MessagingService
(to respond and listen to validation requests) and on the MetaStorageManager
(for interacting with the Meta Storage).
The following changes are proposed to the node start scenario in regards to the changes to the join protocol:
Each blue rectangle represents a start of a component, changes are marked in red and notable action points are marked in green.
According to the diagram, the following changes are proposed:
RESTManager
component is started earlier.CMGManager
component, responsible for managing CMG interactions, introduced. nodeRecoveryFinished
action item introduced. It’s a step within components' start process that denotes that a given node has finished its recovery and is ready to be included in a logical topology.RestManager
should be started earlier, since it is required to register REST message handlers early to handle the init command.
CMGManager
is responsible for interacting with the CMG and should perform the following actions:
TODO
https:// Links to discussions on the devlist, if applicable.lists.apache.org/thread/4lor2vxkg6x94thprvcr0h19rkm6j1gt
Jira | ||||||
---|---|---|---|---|---|---|
|