ID	IEP-77
Author	Aleksandr Polovtsev
Sponsor
Created	03 Aug 2021
Status	DRAFT

Motivation

When building a cluster of Ignite nodes, users need to be able to establish some restrictions on the member nodes based on cluster invariants in order to avoid breaking the consistency of the cluster. Such restrictions may include: having the same product version across the cluster, having consistent table and memory configurations.

Description

Problem statement

This document describes the process of a new node joining a cluster, which consists of a validation phase, where a set of rules are applied to determine whether the incoming node is able to enter the current topology. Validation rules may include node-local information (e.g. product version and the cluster tag) as well as cluster-wide information (discussion needed: are there any properties that need to be retrieved from the Meta Storage?), which means that the validation component may require access to the Meta Storage (it is assumed that Meta Storage contains the consistent cluster-wide information, unless some other mechanism is proposed). The problem is that, according to the Node Lifecycle description, a cluster can exist in a "zombie" state, during which the Meta Storage is unavailable. This means the the validation process can be split into 2 steps:

"Pre-init" validation: a joining node tries to enter the topology on the network level and gets validated against its local properties.
"Post-init" validation: the cluster has received the "init" command, which activates the Meta Storage, and the joining node can be validated against the cluster-wide properties.

Apart from the 2-step validation, there are also the following questions that need to be addressed:

Where will the whole process happen: on the joining node itself or on an arbitrary remote node.
How to deal with different configurations of the Meta Storage: the "most recent" configuration should be consistently delivered to all nodes in a cluster.

Terminology

Init command

The "init" command is supposed to move the cluster from the "zombie" state into the "active" state. It is supposed to have the following characteristics (note that the "init" command has not been specified at the moment of writing and is out of scope of this document, so all statements are approximate and can change in the future):

It should deliver the following information: addresses of the nodes that host the Meta Storage Raft group, Meta Storage Topology version and a cluster tag (described below).
It should deliver this information atomically, i.e. either all nodes enter the "active" state or none.

Initialized and empty nodes

This document uses a notation of "initialized" and "empty" nodes. An initialized node is a node that has received the "init" message sometime in its lifetime and therefore possesses the cluster tag and the Meta Storage Topology version. An empty node is a node that has never received the "init" command and does not possess the aforementioned properties.

Meta Storage Topology version

Meta Storage Topology version is a property that should be used to compute the most "recent" state of a given Meta Storage configuration. At the moment of writing, Meta Storage configuration consists of a list of cluster node names that host the Meta Storage Raft group. A possible implementation can be a monotonically increasing counter, which is increased each time this list is updated.

Join Coordinator

The node join process is proposed to be made centralized: a single node is granted the role of the Join Coordinator and is responsible for the following:

Every new joining node gets redirected to the Coordinator to get validated and to obtain the Meta Storage configuration.
The "init" command can be send to the coordinator to then be atomically broadcasted.

Cluster Tag

A cluster tag is a string that uniquely identifies a cluster. It is generated once per cluster and is distributed across the nodes during the "init" phase. The purpose of a cluster tag is to understand whether a joining node used to be a member of another cluster, in which case its Meta Storage Topology version is not comparable and the joining node should be rejected. Together with the Meta Storage Topology version, it creates a partial ordering that allows to compare different configuration versions.

A cluster tag should consist of two parts:

Human-readable part: a string property that is set by the system administrator. Its purpose is to make the debugging and error reporting easier.
Unique part: a generated unique string (e.g. a UUID). Its purpose is to ensure that cluster tags are different between different clusters.

Implementation details

Join Coordinator election

Before the nodes can start joining a cluster, a node should be elected as the Join Coordinator. For the sake of simplicity, the following algorithm can be proposed, which can later be replaced with something more sophisticated:

Given a list of initial cluster members, choose the "smallest" address (for example, using an alphanumeric order), which will implicitly be considered the Join Coordinator. This requires all nodes to have the same IP Finder configuration (used to obtain the initial cluster member list) to be identical on all initial cluster members.
If the "smallest" address is unavailable, all other nodes should fail to start after a timeout and should be manually restarted again.

discussion needed: What to do when constructing a cluster from some amount of stopped nodes with different Meta Storage configuration? Should it be overridden by the "init" command?

discussion needed: What if we are restarting a cluster and also introducing a new node? What if it is elected as the coordinator?

TODO: describe coordinator re-election.

Initial cluster setup

Initial set of nodes is configured, including the following properties:
1. List of all nodes in the initial cluster setup (provided by the IP Finder).
A Join Coordinator is elected (see "Join Coordinator election");
Join Coordinator generates a Cluster Tag, if it doesn't have it already in its Vault (e.g. an existing cluster is being restarted);
All other nodes connect to the Coordinator and provide the following information:
1. Ignite product version;
2. Cluster Tag, if any (if a node has obtained it at any time during its life);
3. Meta Storage Topology version (if a node has obtained it at any time during its life).
All of the aforementioned parameters get compared with the information, stored on the Coordinator, and if all of the parameters are the same, the joining node is allowed into the cluster. Otherwise, the joining node is rejected.
Join Coordinator adds the new node to the list of validated nodes.
If the joining node is allowed to enter the topology, it receives the following parameters from the Coordinator:
1. Cluster Tag;
2. Meta Storage Topology version (if any, see "Cluster initialization").

discussion needed: What to do if the Coordinator dies during any step of the setup process.

Cluster initialization

After the cluster has been established, it remains in the "zombie" state, until the "init" command arrives.
"Init" command is sent by the administrator either directly to the Join Coordinator, or to any other node, in which case it should be redirected to the Join Coordinator.
The "init" command should specify the following information:
1. Human-readable part of the Cluster Tag;
2. List of addresses of the nodes that should host the Meta Storage Raft group (a.k.a. Meta Storage Configuration).
The Join Coordinator completes the creation of the Cluster Tag by generating the unique part and generates the initial Meta Storage Configuration Version property.
The Join Coordinator atomically broadcasts the Meta Storage Configuration to all valid nodes in the topology. If this step is successful, then Meta Storage is considered to be initialized and available.
The Join Coordinator persists the following information into the Meta Storage (therefore propagating it to all nodes):
1. Cluster Tag;
2. List of addresses of all nodes that have passed the initial validation;
3. Meta Storage Configuration Version.

discussion needed: What to do if the Coordinator dies during any step of the initialization process.

New node join

This section describes a scenario when a new node wants to join an initialized cluster. Depending on the node configuration, there exist multiple possible scenarios:

Empty node joins a cluster

If an empty node tries to join a cluster the following process is proposed:

It connects to a random node, sends the available local validation information and enters the topology, if it gets accepted.
The following scenarios can then happen:
1. The random node is initialized. The joining node should then retrieve the information, that was broadcasted by the "init" command, become initialized and finish the join process.
2. The random node is empty because the cluster has not yet been initialized. In this case the node finishes the join process and remains in the "zombie" state until the "init" command arrives.
3. The random node is empty because it hasn't finished the join process itself. In this case the random node should send a corresponding message, and the joining node should choose another random node and repeat the process.

Initialized node joins a cluster

If an initialized node tries to join a cluster the following process is proposed:

It connects to a random node and sends the available local validation information (including the cluster tag and the Meta Storage version).
The following scenarios can then happen:
1. The random node is initialized and the cluster tags do not match. The joining node must be rejected.
2. The random node is initialized, the cluster tags match, local Meta Storage version is "smaller" than the remote. The node joins the topology and updates its Meta Storage configuration, thus ending the joining process.
3. The random node is initialized, the cluster tags match, local Meta Storage version is "larger" than the remote. The joining node should initiate a process similar to sending the "init" command. Discussion needed: what to do if another "init" command is running in parallel?
4. The random node is empty because the cluster has not yet been initialized. The joining node should initiate a process similar to sending the "init" command. Discussion needed: what to do if another "init" command is running in parallel?
5. The random node is empty because it hasn't finished the join process itself. In this case the random node should send a corresponding message, and the joining node should choose another random node and repeat the process.

Changes in API (WIP)

NetworkTopologyService

Current TopologyService will be renamed to NetworkTopologyService . It is proposed to extend this service to add validation handlers that will validate the joining nodes on the network level.

/**
 * Class for working with the cluster topology on the network level.
 */
public interface NetworkTopologyService {
    /**
     * This topology member.
     */
    ClusterNode localMember();

    /**
     * All topology members.
     */
    Collection<ClusterNode> allMembers();

    /**
     * Handlers for topology events (join, leave).
     */
    void addEventHandler(TopologyEventHandler handler);

    /**
     * Returns a member by a network address
     */
    @Nullable ClusterNode getByAddress(NetworkAddress addr);

    /**
     * Handlers for validating a joining node.
     */
    void addValidationHandler(TopologyValidationHandler handler);
}

TopologyService

The new service will have the same API, but will work on top of the Meta Storage, and will provide methods to work with the list of validated nodes. In addition to that, it will perform the validation of incoming nodes against the Meta Storage, based on the registered validation handlers.

/**
 * Class for working with the cluster topology on the Meta Storage level. Only fully validated nodes are allowed to be present in such topology.
 */
public interface TopologyService {
    /**
     * This topology member.
     */
    ClusterNode localMember();

    /**
     * All topology members.
     */
    Collection<ClusterNode> allMembers();

    /**
     * Handlers for topology events (join, leave).
     */
    void addEventHandler(TopologyEventHandler handler);

    /**
     * Returns a member by a network address
     */
    @Nullable ClusterNode getByAddress(NetworkAddress addr);

    /**
     * Handlers for validating a joining node.
     */
    void addValidationHandler(TopologyValidationHandler handler);
}

TopologyService will depend on the MessagingService (to respond and listen to validation requests) and on the MetaStorageManager (for interacting with the Meta Storage).

Risks and Assumptions

"Init" command is not fully specified and can influence the design.
Proposed implementation does not discuss message encryption and security credentials.
Two-layered topology view may be confusing to use.

Page tree

Motivation

Description

Problem statement

Terminology

Init command

Initialized and empty nodes

Meta Storage Topology version

Join Coordinator

Cluster Tag

Implementation details

Join Coordinator election

Initial cluster setup

Cluster initialization

New node join

Empty node joins a cluster

Initialized node joins a cluster

Changes in API (WIP)

NetworkTopologyService

TopologyService

Risks and Assumptions

Discussion Links

Reference Links

Tickets

Page tree

IEP-77: Node Join Protocol and Initialization [WIP]

Motivation

Description

Problem statement

Terminology

Init command

Initialized and empty nodes

Meta Storage Topology version

Join Coordinator

Cluster Tag

Implementation details

Join Coordinator election

Initial cluster setup

Cluster initialization

New node join

Empty node joins a cluster

Initialized node joins a cluster

Changes in API (WIP)

NetworkTopologyService

TopologyService

Risks and Assumptions

Discussion Links

Reference Links

Tickets