Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

When building a cluster of Ignite nodes, users need to be able to establish some restrictions on the member nodes based on cluster invariants in order to avoid breaking the consistency of the cluster. Such restrictions may include: having the same product version across the cluster, having consistent table and memory configurations, enforcing a particular cluster state.

Description

Problem statement

This document describes the process of a new node joining a cluster, which includes consists of a validation step phase, where a set of rules are applied to determine whether the incoming node is able to enter the current topology. These Validation rules may include node-local information (e.g. node product version and the cluster tag) as well as cluster-wide information (e.g. data encryption algorithmdiscussion needed: are there any properties that need to be retrieved from the Meta Storage?), which means that the validation component may require access to the Meta Storage (it is assumed that Meta Storage contains the consistent cluster-wide information, unless some other mechanism is proposed). The problem is that, according to the Node Lifecycle description, a cluster can exist in a "zombie" state, during which the Meta Storage is unavailable. This means the the validation process can be split into 2 steps:

...

  1. Where will the whole process happen: on the joining node itself or on an arbitrary remote node.
  2. How to negate possible security issues if a not yet fully validated node gets access to the Meta storage.How to deal with different configurations of the Meta Storage: the "most recent" configuration should be consistently delivered to all nodes in a cluster.

...

  1. It should deliver the following information: addresses of the nodes that host the Meta Storage Raft group, Meta Storage Topology version and a cluster tag (described below).
  2. It should deliver this information atomically, i.e. either all nodes enter the "active" state or none. As a possible solution, it can be implemented similarly to a two-phase commit: first, a "prepare" message is broadcasted to all nodes in the current topology. The initiator node remembers the topology members at the start of the prepare phase and restarts the operation (or sends additional messages) until the topology is stable. After the prepare phase is finished, the commit message is broadcasted. Until the commit phase finishes, no new nodes are allowed to enter the cluster.

Initialized and empty nodes

This document uses a notation of "initialized" and "empty" nodes. An initialized node is a node that has received the "init" message sometime in its lifetime and therefore possesses the cluster tag and the meta storage Meta Storage Topology version. An empty node is a node that has never received the "init" command and does not possess the aforementioned properties.

Meta Storage Topology version

Meta Storage Topology version is a property that should be used to compute the most "recent" state of a given Meta Storage configuration. At the moment of writing, Meta Storage configuration consists of a list of cluster node names that host the Meta Storage Raft group. A possible implementation can be a monotonically increasing counter, which is increased every time the each time this list is updated.

Join Coordinator

The node join process is proposed to be made centralized: a single node is granted the role of the Join Coordinator and is responsible for the following:

  1. Every new joining node gets redirected to the Coordinator to get validated and to obtain the Meta Storage configuration

...

  1. .
  2. The "init" command can be send to the coordinator to then be atomically broadcasted.

Cluster Tag

...

A cluster tag is a string that uniquely identifies a cluster (e. g. a UUID). It is generated once per cluster and is distributed across the nodes during the "init" phase. The purpose of a cluster tag is to understand whether a joining node used to be a member of another cluster, in which case its Meta Storage Topology version is not comparable and the joining node should be rejected. Together with the Meta Storage Topology version, it creates a partial ordering that allows to compare different configuration versions.

Validation approaches

Local validation

Local validation approach requires the joining node to retrieve some information from a random node/Meta Storage and deciding to join the cluster based on that information.

This approach has the following pros and cons:

  1. When using in the rolling upgrade scenario, it might be easier to maintain backward compatibility: a newer joining node already knows about the requirements of the older nodes in the cluster.
  2. Possible security issues: if a node is able to allow itself to join, it might be easier to compromise the cluster.

Remote validation

Remote validation approach requires the joining node to send some information about itself  to a remote node, which decides whether to allow the new node to join or not.

This approach has the following pros and cons:

  1. This approach is used in Ignite 2, which may be more familiar to users and developers.
  2. It may be more secure, since a node can't join without notifying at least one valid node.
  3. Harder to support backward compatibility.

Discussion needed: At the time of writing this document, it is assumed that validation protocol is going to be remote.

Implementation details

From a bird's eye view, the whole join protocol looks like the following:

  1. A joining node contacts a random node in a cluster and performs the "pre-init" validation by sending some locally available information.
  2. If the first step is successful, the node gets validated against the distributed properties (depending on the cluster state).
  3. If the second step is successful, the node obtains necessary information to set up the Meta Storage and other components.

Depending on the state of the cluster and the joining node, there can be multiple scenarios that influence the behavior of the protocol. They are described below.  

Initial common logic

A cluster tag should consist of two parts:

  1. Human-readable part: a string property that is set by the system administrator. Its purpose is to make the debugging and error reporting easier.
  2. Unique part: a generated unique string (e.g. a UUID). Its purpose is to ensure that cluster tags are different between different clusters.

Implementation details

Join Coordinator election

Before the nodes can start joining a cluster, a node should be elected as the Join Coordinator. For the sake of simplicity, the following algorithm can be proposed, which can later be replaced with something more sophisticated:

  1. Given a list of initial cluster members, choose the "smallest" address (for example, using an alphanumeric order), which will implicitly be considered the Join Coordinator. This requires all nodes to have the same IP Finder configuration (used to obtain the initial cluster member list) to be identical on all initial cluster members.
  2. If the "smallest" address is unavailable, all other nodes should fail to start after a timeout and should be manually restarted again.

discussion needed: What to do when constructing a cluster from some amount of stopped nodes with different Meta Storage configuration? Should it be overridden by the "init" command?

discussion needed:  What if we are restarting a cluster and also introducing a new node? What if it is elected as the coordinator?

TODO: describe coordinator re-election.

Initial cluster setup

  1. Initial set of nodes is configured, including the following properties:
    1. List of all nodes in the initial cluster setup (provided by the IP Finder).
  2. A Join Coordinator is elected (see "Join Coordinator election");
  3. Join Coordinator generates a Cluster Tag, if it doesn't have it already in its Vault (e.g. an existing cluster is being restarted);
  4. All other nodes connect to the Coordinator and provide the following information:
    1. Ignite product version;
    2. Cluster Tag, if any (if a node has obtained it at any time during its life);
    3. Meta Storage Topology version (if a node has obtained it at any time during its life).
  5. All of the aforementioned parameters get compared with the information, stored on the Coordinator, and if all of the parameters are the same, the joining node is allowed into the cluster. Otherwise, the joining node is rejected.
  6. Join Coordinator adds the new node to the list of validated nodes.
  7. If the joining node is allowed to enter the topology, it receives the following parameters from the Coordinator:
    1. Cluster Tag;
    2. Meta Storage Topology version (if any, see "Cluster initialization").

discussion needed: What to do if the Coordinator dies during any step of the setup process.

Cluster initialization

  1. After the cluster has been established, it remains in the "zombie" state, until the "init" command arrives.
  2. "Init" command is sent by the administrator either directly to the Join Coordinator, or to any other node, in which case it should be redirected to the Join Coordinator.
  3. The "init" command should specify the following information:
    1. Human-readable part of the Cluster Tag;
    2. List of addresses of the nodes that should host the Meta Storage Raft group (a.k.a. Meta Storage Configuration).
  4. The Join Coordinator completes the creation of the Cluster Tag by generating the unique part and generates the initial Meta Storage Configuration Version property.
  5. The Join Coordinator atomically broadcasts the Meta Storage Configuration to all valid nodes in the topology. If this step is successful, then Meta Storage is considered to be initialized and available.
  6. The Join Coordinator persists the following information into the Meta Storage (therefore propagating it to all nodes):
    1. Cluster Tag;
    2. List of addresses of all nodes that have passed the initial validation;
    3. Meta Storage Configuration Version.

discussion needed: What to do if the Coordinator dies during any step of the initialization process.

New node join

This section describes a scenario when a new node wants to join an initialized cluster. Depending on the node configuration, there exist multiple possible scenarios:Regardless of the node state, the initial procedure goes as follows: the joining node tries to enter the topology. It is possible to piggyback on the transport of the membership protocol in order to exchange validation messages before allowing to send membership messages (similar to the handshake protocol). During this step it sends some information (cluster tag, Meta Storage version, node version) to a random node and gets validated (more details below). After this step is complete, the joining node becomes visible through the Topology Service, therefore establishing an invariant that visible topology will always consist of nodes that have passed the first validation step. Possible issues: there can be a race condition when multiple conflicting nodes join at the same time, in which case only the first node to join will be valid. This can be considered expected behavior, because such situations can only occur during the initial set up of a cluster, which is a manual process and requires manual intervention anyway.

Empty node joins a cluster

If an empty node tries to join a cluster the following process is proposed:

  1. It connects to a random node, sends the available local validation information and enters the topology, if it gets accepted.
  2. The following scenarios can then happen:
    1. The random node is initialized. The joining node should then retrieve the information, that was broadcasted by the "init" command, become initialized and finish the join process.
    2. The random node is empty because the cluster has not yet been initialized. In this case the node finishes the join process and remains in the "zombie" state until the "init" command arrives.
    3. The random node is empty because it hasn't finished the join process itself. In this case the random node should send a corresponding message, and the joining node should choose another random node and repeat the process.

Initialized node joins a cluster

If an initialized node tries to join a cluster the following process is proposed:

...

  1. IEP-73: Node startup
  2. https://github.com/apache/ignite-3/blob/main/modules/runner/README.md
  3. IEP-67: Networking module
  4. IEP-61: Common Replication Infrastructure

Tickets

Jira
serverASF JIRA
serverId5aa69414-a9e9-3523-82ec-879b028fb15b
keyIGNITE-15114