...

ID

IEP-77

Author

Aleksandr Polovtsev

Sponsor

Alexey Scherbakov

Created

03 Aug 2021

Status

colour

Grey

title
Green

DRAFT

COMPLETED

Table of Contents

Motivation

When building a cluster of Ignite nodes, users need to be able to establish some restrictions on the member nodes based on cluster invariants in order to avoid breaking the consistency of the cluster. Such restrictions may include: having the same product version across the cluster, having consistent table and memory configurationsIgnite clusters are required to have a dynamically changing topology due to the nature of distributed systems - nodes can fail and need to be restarted at any time, so new nodes can be introduced or removed from the cluster. Users expect that topology changes do not break the consistency of the cluster and it remains operational.

Description

Problem statement

This document describes the process of creating a new node joining a cluster, which consists of a validation phase, where a set of rules are applied to determine whether the incoming node is able to enter the current topology. Validation rules may include node-local information (e.g. product version and the cluster tag) as well as cluster-wide information (discussion needed: are there any properties that need to be retrieved from the Meta Storage?), which means that the validation component may require access to the Meta Storage (it is assumed that Meta Storage contains the consistent cluster-wide information, unless some other mechanism is proposed). The problem is that, according to the Node Lifecycle description, a cluster can exist in a "zombie" state, during which the Meta Storage is unavailable. This means the the validation process can be split into 2 steps:

"Pre-init" validation: a joining node tries to enter the topology on the network level and gets validated against its local properties.
"Post-init" validation: the cluster has received the "init" command, which activates the Meta Storage, and the joining node can be validated against the cluster-wide properties.

Apart from the 2-step validation, there are also the following questions that need to be addressed:

Where will the whole process happen: on the joining node itself or on an arbitrary remote node.
How to deal with different configurations of the Meta Storage: the "most recent" configuration should be consistently delivered to all nodes in a cluster.

Terminology

Init command

The "init" command is supposed to move the cluster from the "zombie" state into the "active" state. It is supposed to have the following characteristics (note that the "init" command has not been specified at the moment of writing and is out of scope of this document, so all statements are approximate and can change in the future):

It should deliver the following information: addresses of the nodes that host the Meta Storage Raft group, Meta Storage Topology version and a cluster tag (described below).
It should deliver this information atomically, i.e. either all nodes enter the "active" state or none.

Initialized and empty nodes

This document uses a notation of "initialized" and "empty" nodes. An initialized node is a node that has received the "init" message sometime in its lifetime and therefore possesses the cluster tag and the Meta Storage Topology version. An empty node is a node that has never received the "init" command and does not possess the aforementioned properties.

Meta Storage Topology version

Meta Storage Topology version is a property that should be used to compute the most "recent" state of a given Meta Storage configuration. At the moment of writing, Meta Storage configuration consists of a list of cluster node names that host the Meta Storage Raft group. A possible implementation can be a monotonically increasing counter, which is increased each time this list is updated.

Join Coordinator

The node join process is proposed to be made centralized: a single node is granted the role of the Join Coordinator and is responsible for the following:

...

cluster of Ignite nodes and adding new nodes to it. It describes the following concepts:

Cluster setup - the process of assembling a predefined set of nodes into a cluster and preparing it for the next lifecycle steps;
Cluster initialization - after setting up the cluster, it should be transferred to a state when it is ready to serve user requests;
Node validation - in order to maintain the cluster configuration in a consistent state, the joining nodes should be compatible with all of the existing cluster nodes.

Terminology

Meta Storage

Meta Storage is a subset of cluster nodes hosting a Raft group responsible for storing a master copy of cluster metadata.

Cluster Management Group

Cluster Management Group or CMG is a subset of cluster nodes hosting a Raft group. CMG leader is responsible for orchestrating the node join process.

Init command

The init command is issued by a user with a CLI tool and moves a cluster from the idle state into the running state.

Idle and running cluster

An idle cluster is a cluster that has been assembled for the first time and has never received the init command, therefore the Meta Storage of this cluster does not exist. Acluster can be considered running if it has obtained the information about the Meta Storage and CMG location.

Empty and initialized node

Every Ignite node starts in the empty state. After joining the cluster and passing the validation step, a node obtains the location of the Meta Storage and transitions to the initialized state

...

.

Cluster Tag

A cluster tag is a string that uniquely identifies a cluster. It is generated once per cluster and is distributed across the nodes during the "init" phasenode join process. The purpose of a cluster tag is to understand whether a joining node used to be a member of another cluster, in which case its Meta Storage Topology version is not comparable and the joining node it should be rejected. Together with the Meta Storage Topology version, it creates a partial ordering that allows to compare different configuration versions.

A cluster tag should consist consists of two parts:

Human-readable part (Cluster Name): a string property that is set by the system administrator. Its purpose is to make the debugging and error reporting easier.
Unique part (Cluster ID): a generated unique string (e.g. a UUID). Its purpose is to ensure that cluster tags Cluster Tags are different between different clusters.

Implementation details

Join Coordinator election

Before the nodes can start joining a cluster, a node should be elected as the Join Coordinator. For the sake of simplicity, the following algorithm can be proposed, which can later be replaced with something more sophisticated:

Given a list of initial cluster members, choose the "smallest" address (for example, using an alphanumeric order), which will implicitly be considered the Join Coordinator. This requires all nodes to have the same IP Finder configuration (used to obtain the initial cluster member list) to be identical on all initial cluster members.
If the "smallest" address is unavailable, all other nodes should fail to start after a timeout and should be manually restarted again.

discussion needed: What to do when constructing a cluster from some amount of stopped nodes with different Meta Storage configuration? Should it be overridden by the "init" command?

discussion needed: What if we are restarting a cluster and also introducing a new node? What if it is elected as the coordinator?

discussion needed: Can we actually use Raft and use its leader as the Join Coordinator? Can we use Raft's log to store the Meta Storage Topology Version? This way, it will be possible

TODO: describe coordinator re-election.

Initial cluster setup

Initial set of nodes is configured, including the following properties:
List of all nodes in the initial cluster setup (provided by the IP Finder).
A Join Coordinator is elected (see "Join Coordinator election");
Join Coordinator generates a Cluster Tag, if it doesn't have it already in its Vault (e.g. an existing cluster is being restarted);
All other nodes connect to the Coordinator and provide the following information:
Ignite product version;
Cluster Tag, if any (if a node has obtained it at any time during its life);
Meta Storage Topology version (if a node has obtained it at any time during its life).
All of the aforementioned parameters get compared with the information, stored on the Coordinator, and if all of the parameters are the same, the joining node is allowed into the cluster. Otherwise, the joining node is rejected.
Join Coordinator adds the new node to the list of validated nodes.
If the joining node is allowed to enter the topology, it receives the following parameters from the Coordinator:
Cluster Tag;
Meta Storage Topology version (if any, see "Cluster initialization").

discussion needed: What to do if the Coordinator dies during any step of the setup process.

Cluster initialization

After the cluster has been established, it remains in the "zombie" state, until the "init" command arrives;
"Init" command is sent by the administrator either directly to the Join Coordinator, or to any other node, in which case it should be redirected to the Join Coordinator;
The "init" command should specify the following information:
Human-readable part of the Cluster Tag;
List of addresses of the nodes that should host the Meta Storage Raft group (a.k.a. Meta Storage Configuration).
The Join Coordinator completes the creation of the Cluster Tag by generating the unique part and generates the initial Meta Storage Topology Version property;
The Join Coordinator atomically broadcasts the Meta Storage Configuration to all valid nodes in the topology. If this step is successful, then Meta Storage is considered to be initialized and available;
The Join Coordinator persists the following information into the Meta Storage (therefore propagating it to all nodes):
Cluster Tag;
List of addresses of all nodes that have passed the initial validation;
Meta Storage Configuration Version.

Physical topology

Physical topology consists of nodes that can communicate with each other on the network level. Nodes enter the topology though a network discovery mechanism, currently provided by the SWIM protocol. However, such nodes may not yet have passed the validation step, so not all nodes from the physical topology can participate in cluster-wide activities. In order to do that, a node must enter the logical topology.

Logical topology

Logical topology consists of nodes that have passed the validation step and are therefore able to participate in cluster-wide activities. Logical topology is maintained by the Cluster Management Group.

Protocol description

Initial cluster setup

This section describes the process of assembling a new Ignite cluster from a set of empty nodes.

Initial set of nodes is started, providing the following properties:
A subset of nodes (minimum 1, more can be specified to increase startup reliability) in the initial cluster setup, provided by an IP Finder. Concrete IP Finder implementations can be used to obtain the seed members, depending on the environment the cluster is running in (e.g. conventional networks or Kubernetes), and are specified either via the configuration or the CLI.
The nodes assemble into a physical topology using a network discovery protocol (e.g. SWIM), bootstrapped with the provided seed members.
An init command is sent by a user to a single node in the cluster, providing the following information:
Consistent IDs (names) of the nodes that will host the Meta Storage;
Consistent IDs (names) of the nodes that will comprise the Cluster Management Group. It is possible for both of these address sets to be the same.
The node, that has received the command, propagates it to all members of the physical topology that were specified in the init command. These members should start the corresponding Raft groups and, after the group leaders are elected, the initial node should return a response to the user. In case of errors, Raft groups should be removed and an error response will be returned to the user. If no response has been received, the user should retry sending the command with the same parameters to a different node.
As soon as the CMG leader is elected, the leader initializes the CMG state by applying a Raft command (e.g. ClusterInitCommand), which includes:
A generated Cluster Tag;
Ignite product version.
After the command has been applied, the leader sends a message to all nodes in the physical topology, containing the location of the CMG nodes. At this point the cluster can be considered as running.
Upon receiving the message, each node sends a join request to the CMG leader, which consists of:
Protocol version (an integer which is increased every time the join procedure is changed, needed to support backwards compatibility);
Ignite product version;
Information from the join requests gets validated on the leader and, if the properties are equal to the CMG state (in case of the protocol version, a different comparison algorithm might be used), a successful response is sent, containing:
Consistent IDs (names) of the Meta Storage nodes;
Cluster Tag.
If the properties do not match, an error response is sent and the joining node is rejected.
If the joining node has passed the validation and received the validation response, it starts some local recovery procedures (if necessary) and sends a response to the CMG leader, indicating that it is ready to be added to the logical topology.
The CMG leader issues a Raft command (e.g. AddNodeCommand), which adds the node to the logical topology.

An example of this flow can be found on the diagram below. Some initialization steps are omitted for Node C as they are identical to the corresponding steps on Node B.

Image Addeddiscussion needed: What to do if the Coordinator dies during any step of the initialization process.

Adding a new node to a running cluster

This section describes a scenario when a new node wants to join an initialized cluster. Depending on the node configuration, state of the node or the cluster there exist multiple 4 possible scenarios:

Empty node joins

...

an idle cluster

If an empty node tries to join a cluster the following process is proposed:

The new node connects to a random node. If this node is not the Join Coordinator, the random node redirects the new node to it.
The new node provides its Ignite product version and is validated on the Coordinator (see "Initial cluster setup").
If the validation is successful, the new node is added to the list of validated nodes and receives the Meta Storage Configuration as the response.

...

This scenario is equivalent to the initial cluster setup.

Initialized node joins an idle cluster

Such nodes should never be able to join the cluster, because it will fail the cluster tag validation step.

Empty node joins a running cluster

The new node enters the physical topology;
CMG leader discovers the new node and sends a message to it, containing the location of the CMG nodes;
Upon receiving the message, the joining node should execute the validation procedure and be added to the logical topology, as described by steps 7-9 of the Initial cluster setup section.

Initialized node joins a

...

running cluster

The new node connects to a random node. If this node is not the Join Coordinator, the random node redirects the new node to it.
The joining node connect to the Coordinator and provide the following information:
Ignite product version;
Cluster Tag;
Meta Storage Topology Version.
If the Cluster Tag and the Ignite product version do not match with the ones stored on the Coordinator, the node is rejected. If they match, the Meta Storage Topology Version is compared. If the node's Meta Storage Topology Version is smaller or equal to the Version on the Coordinator, it is allowed to enter the topology and the updates its Meta Storage Configuration.

discussion needed: What to do if the joining node's Meta Storage Topology Version is greater than the Version stored on the Coordinator.

Changes in API (WIP)

NetworkTopologyService

Current TopologyService will be renamed to NetworkTopologyService . It is proposed to extend this service to add validation handlers that will validate the joining nodes on the network level.

Code Block

language	java

/**
 * Class for working with the cluster topology on the network level.
 */
public interface NetworkTopologyService {
    /**
     * This topology member.
     */
    ClusterNode localMember();

    /**
     * All topology members.
     */
    Collection<ClusterNode> allMembers();

    /**
     * Handlers for topology events (join, leave).
     */
    void addEventHandler(TopologyEventHandler handler);

    /**
     * Returns a member by a network address
     */
    @Nullable ClusterNode getByAddress(NetworkAddress addr);

    /**
     * Handlers for validating a joining node.
     */
    void addValidationHandler(TopologyValidationHandler handler);
}

TopologyService

The new service will have the same API, but will work on top of the Meta Storage, and will provide methods to work with the list of validated nodes. In addition to that, it will perform the validation of incoming nodes against the Meta Storage, based on the registered validation handlers.

Code Block

language	java

/**
 * Class for working with the cluster topology on the Meta Storage level. Only fully validated nodes are allowed to be present in such topology.
 */
public interface TopologyService {
    /**
     * This topology member.
     */
    ClusterNode localMember();

    /**
     * All topology members.
     */
    Collection<ClusterNode> allMembers();

    /**
     * Handlers for topology events (join, leave).
     */
    void addEventHandler(TopologyEventHandler handler);

    /**
     * Returns a member by a network address
     */
    @Nullable ClusterNode getByAddress(NetworkAddress addr);

    /**
     * Handlers for validating a joining node.
     */
    void addValidationHandler(TopologyValidationHandler handler);
}

TopologyService will depend on the MessagingService (to respond and listen to validation requests) and on the MetaStorageManager (for interacting with the Meta Storage).

Risks and Assumptions

starts the CMG group server or client, depending on the existing local node configuration;
The new node enters the physical topology;
If the new node is elected CMG leader, it should use the local CMG state to validate itself, otherwise see point 4.
CMG leader discovers the new node and sends a message to it, containing the location of the CMG nodes;
Upon receiving the message, the joining node should execute the validation procedure and be added to the logical topology, as described by steps 7-9 of the Initial cluster setup section.

Implementation details

Node start flow

The following changes are proposed to the node start scenario in regards to the changes to the join protocol:

Image Added

Each blue rectangle represents a start of a component, changes are marked in red and notable action points are marked in green.

According to the diagram, the following changes are proposed:

RESTManager component is started earlier.
CMGManager component, responsible for managing CMG interactions, introduced.
nodeRecoveryFinished action item introduced. It’s a step within components' start process that denotes that a given node has finished its recovery and is ready to be included in a logical topology.

RestManager changes

RestManager should be started earlier, since it is required to register REST message handlers early to handle the init command.

Image Added

CMGManager

CMGManager is responsible for interacting with the CMG and should perform the following actions:

Launch the local CMG Raft server in case an existing local configuration exists;
Register a REST message handler for the init command;
Register a message handler that will be listening for messages from the CMG leader and sending a join request.

Changes in API

TODO

Risks and Assumptions

"Init" command is not fully specified and can influence the design.
Proposed implementation does not discuss message encryption and security credentials.;
Protocol for nodes leaving the topology is out of scope of this document;
Protocol for migrating CMG and Meta Storage to different nodes is out of scope of this documentTwo-layered topology view may be confusing to use.

Discussion Links

https:// Links to discussions on the devlist, if applicable.lists.apache.org/thread/4lor2vxkg6x94thprvcr0h19rkm6j1gt

Reference Links

Tickets

Jira

server	ASF JIRA
serverId	5aa69414-a9e9-3523-82ec-879b028fb15b
key	IGNITE-15114

Page tree

Page History

Versions Compared

Old Version 11

New Version Current

Key

Motivation

Description

Problem statement

Terminology

Init command

Initialized and empty nodes

Meta Storage Topology version

Join Coordinator

Terminology

Meta Storage

Cluster Management Group

Init command

Idle and running cluster

Empty and initialized node

Cluster Tag

Implementation details

Join Coordinator election

Initial cluster setup

Cluster initialization

Physical topology

Logical topology

Protocol description

Initial cluster setup

Adding a new node to a running cluster

Empty node joins

an idle cluster

Initialized node joins an idle cluster

Empty node joins a running cluster

Initialized node joins a

running cluster

Changes in API (WIP)

NetworkTopologyService

TopologyService

Risks and Assumptions

Implementation details

Node start flow

RestManager changes

CMGManager

Changes in API

Risks and Assumptions

Discussion Links

Reference Links

Tickets