Data consistency (wip)

The document describes principles used for achieving consistency between data copies.

Cluster topology

Cluster topology is a ordered set of client and server nodes at a certain point in time. Each stable topology is assigned a topology version number - monotonically increasing pair of counters (majorVer, minorVer). Each node maintains a list of nodes in the topology and it is the same on all nodes for given version. It is important to note that each node receives information about the topology change eventulally, that is, independently of the other node at the different time.

Affinity function

A set of all possible cache keys is divided between partitions. A partition is an indivisible data set in a grid, a subset of a total set of keys. The key uniquely determines the partition in which it will be stored.

Every change in topology causes new topology version. For each new version of the topology it is necessary to calculate how the data is distributed among the nodes in topology. Those nodes that are responsible for storing data are called affinity nodes. Each node owns a subset of partitions. If a new affinity node has joined then it is necessary to rebalance part of the data to the new node.

The distribution (sharding) of data on the nodes of the grid is defined by affinity function.

First, affinity function uniquely defines a partition for key: P(key) = partition

This is a static property and cannot change in time.

Second, affinity function defines which partitions belongs to the node: F(topology, part) = list(nodes)

At different points in time a partition can belong to different nodes. This is a dynamic property of a partition.

The number of nodes in this list is equal to the number of copies of the partition. Recall that one of the fundamental properties of a grid is its high availability. With the loss of a certain number of nodes, complete data loss does not occur and the load is not interrupted. The need to store multiple copies is due to the implementation of this property.

The first node in the list is the so-called primary node, has a special coordinating role for all operations on the partition. The remaining are backup nodes. In the event of a loss of a primary node, one of the backup nodes becomes the new primary.

Todo PIC for some AF calculated state.

An important property of affinity function is the uniform distribution of partitions over nodes.

The default implementation uses the rendezvous hashing algorithm https://en.wikipedia.org/wiki/Rendezvous_hashing

Due to implementation of these properties linear scalability is achieved when adding new nodes to the grid.

Re-balance

Re-balance - a procedure for exchanging data between nodes to ensure uniform distribution of data (redistribution of data between nodes) [1] and to ensure data consistency [2].

[1] Partition state based re-balance
In the first case, the reason for the rebalance is a change in the list of nodes - owners of the partition. If a new node has joined the topology then part of the partitions will be reassigned to it due to the affinity function property of the uniform distribution of partitions among the nodes. If the node is left then the re-balance will be triggered to ensure the availability of the required number of copies of the partition in the grid.

[2] Counter based re-balance
In the second case, which is relevant only for the enabled persistence mode, re-balance is used to restore the consistency of the data if the node was absent for some time in the topology under the load.
When the node re-joins the topology, the partition update counter will be analyzed to determine how much the partition has “lagged” from other copies.
Depending on how much the partition is lagging behind, the second type of rebalance can use the write-ahead log as a data source. In this case, the re-balance can be performed without a complete reloading of the partition and is called “historical”.
Another scenario for the occurrence of [2] is the activation of the grid, where some nodes may have stale data.
Re-balance is the exchange of supply and demand messages between nodes. A demand message is sent from node where partition has stale data to node with actual data. Supply messages is send in response to demand message.

Update counter structure

Update counter = [lwm, out-of-order updates tracking set, hwm]

Consider an example of counter status:

lwm = 5, ooo = (6.1) (8.2), hwm = 11

Lwm - low water mark - number of the last applied sequential update.
That is, in this partition all updates from 1 to 5 are present.
Partial - applied out-of-order updates. In the example in the partition there are updates 7,9,10
But skipped updates 6, 8, 11
Hwm = 11 indicates the number of updates made to the partition, some of which are still “in flight”.
highestApplied = 9 the deduced property - an update with the maximum serial number

Hwm is wound only on the primary node to ensure the monotonicity of the counter lwm <= hwm.

Due to the fact that it is possible to reorder transactions into one partition, updates can be applied in any order, which is why omissions are possible in the sequence of updates. To do this, a special structure has been introduced into the counter for tracking updates applied in random order.

Update counters are used to determine the state when the backup is behind the primaries. A lag can occur, for example, if the node went out under load and while there were no updates to one of the partitions for which it is the owner. If for each partition the number of updates is the same, then the partitions are identical.
If on one of the partitions the counter lags by k, then it means you need to use k lagging updates to synchronize the partitions. Only relevant for persistent mode.
For in-memory, the partition is always restored from scratch.
Thus, it is extremely important to correctly track update counters, since it depends on them whether a rebalance will be launched to restore the integrity of partitions [2].

Page tree

Data consistency (wip)