Data consistency (wip)

The document describes principles used for achieving consistency between data copies.

Cluster topology

Cluster topology is a ordered set of client and server nodes at a certain point in time. Each stable topology is assigned a topology version number - monotonically increasing pair of counters (majorVer, minorVer). Each node maintains a list of nodes in the topology and it is the same on all nodes for given version. It is important to note that each node receives information about the topology change eventulally, that is, independently of the other node at the different time.

Affinity function

A set of all possible cache keys is divided between partitions. A partition is an indivisible data set in a grid, a subset of a total set of keys. The key uniquely determines the partition in which it will be stored.

Every change in topology causes new topology version. For each new version of the topology it is necessary to calculate how the data is distributed among the nodes in topology. Those nodes that are responsible for storing data are called affinity nodes. Each node owns a subset of partitions. If a new affinity node has joined then it is necessary to rebalance part of the data to the new node.

The distribution (sharding) of data on the nodes of the grid is defined by affinity function.

First, affinity function uniquely defines a partition for key: P(key) = partition

This is a static property and cannot change in time.

Second, affinity function defines which partitions belongs to the node: F(topology, part) = list(nodes)

At different points in time a partition can belong to different nodes. This is a dynamic property of a partition.

The number of nodes in this list is equal to the number of copies of the partition. Recall that one of the fundamental properties of a grid is its high availability. With the loss of a certain number of nodes, complete data loss does not occur and the load is not interrupted. The need to store multiple copies is due to the implementation of this property.

The first node in the list is the so-called primary node, has a special coordinating role for all operations on the partition. The remaining are backup nodes. In the event of a loss of a primary node, one of the backup nodes becomes the new primary.

Todo PIC for some AF calculated state.

An important property of affinity function is the uniform distribution of partitions over nodes.

The default implementation uses the rendezvous hashing algorithm https://en.wikipedia.org/wiki/Rendezvous_hashing

Due to implementation of these properties linear scalability is achieved when adding new nodes to the grid.

Re-balance

Re-balance - a procedure for exchanging data between nodes to ensure uniform distribution of data (redistribution of data between nodes) [1] and to ensure data consistency [2].

[1] Partition state based re-balance
In the first case, the reason for the rebalance is a change in the list of nodes - owners of the partition. If a new node has joined the topology then part of the partitions will be reassigned to it due to the affinity function property of the uniform distribution of partitions among the nodes. If the node is left then the re-balance will be triggered to ensure the availability of the required number of copies of the partition in the grid.

[2] Counter based re-balance
In the second case, which is relevant only for the enabled persistence mode, re-balance is used to restore the consistency of the data if the node(s) was absent for some time in the topology under the load.
When the node re-joins the topology, the partition update counter will be analyzed to determine how much the partition has “lagged” from other copies.
Depending on how much the partition is behind, the second type of rebalance can use the write-ahead log as a data source. In this case, the re-balance can be performed without a complete reloading of the partition and is called “historical”.
Another scenario for the occurrence of [2] is the activation of the grid, which was turned off accidentally (under load)
Re-balance is the exchange of supply and demand messages between nodes. A message demand sends a node on which the partition is not consistent. Supply messages contain the next update batch and are sent in response to demand messages.

Page tree

Data consistency (wip)