Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

The document describes principles used for achieving consistency between data copies.

First we will understand some basic concepts, when move to more complex topicslet's give several definitions.

Cluster topology

Cluster topology is a ordered set of client and server nodes at a certain point in time. Each stable topology is assigned a topology version number - monotonically increasing pair of counters (majorVer, minorVer). Each node maintains a list of nodes in the topology and it is the same on all nodes for given version. It is important to note that each node receives information about the topology change eventulally, that is, independently of the other node at the different time.

...

Look at possible partition distribution between nodes for topology with 3 server nodes s1, s2, s3 having 6 partitions 0, 1, 2, 3, 4, 5.

Function F could produce assignment someting something like this (in ideal world):

F(0) = s1, s2, s3

...

Rebalancing - a procedure for exchanging data between nodes to ensure uniform distribution of data (redistribution of data between nodes) according to affinity assignment [1] and to ensure data consistency if copies diverge [2].

[1] Partition state Affinity based rebalancing
In the first case, the reason for the rebalancing is a change in the list of nodes - owners of the partition. If a new node has joined the topology then part of the partitions will be reassigned to it due to the affinity function property of the uniform distribution of partitions among the nodes. If the node is left then the re-balance will be triggered to ensure the availability of the required number of copies of the partition in the grid.

[2] Counter Updates counter based rebalancing.
In the second case, which is relevant only for the persistence mode, re-balance is used to restore the consistency of the data if the node was absent for some time in the topology under the load.
When the node re-joins the topology, it will have stale data, and the partition update counter will be analyzed to determine how much the partition has partitions have “lagged” from other copies.
Depending on how much strong the partition is partitions are lagging behind, the second type of rebalance can use the write-ahead log as a data source for updates history. In this case, the rebalancing can be performed without complete reloading of the partition and is called “historical”.
Another scenario for the occurrence of [2] is the activation of the grid, where some nodes may have stale data.
Re-balance Rebalancing is the exchange of supply and demand messages between nodes. A demand message is sent from node where partition has stale data to node with actual data. Supply messages is send in response to demand message, after processing a supply message next demand message is send and so on until nothing more to supplyrebalance.

Update counter structure and flow

Each update to a partition receives a sequential number for a purpose of ensuring data consistency between copies. A special structure called partition update counter is used to assign and increment update counters.

...

/** Queue of applied out of order counter updates. */
private TreeSet<Item> SortedSet<Range> oomQueue = new TreeSet<>();

Low watermark, or update counter is used to track sequential updates. It is only incremented when corresponding update is recorded to durable storage and no missed updates are exist with lesser counter value.

...

High watermark or reservation counter is increment for each pending update, which even not guaranteed to succeed. Reservation counter is assigned during tx prepare phase.

Given all said, there is an important invariant: HWM >= LWMRange


Out of order update queue is used, literally, for recording updates applied out of order. This is possible due to a fact what updates this higher reservation counter could be applied before updates with lesser reservation counter, causing gaps in update sequence.

Where is an important invariant which can be derived from statements above: HWM >= LWM.

Update counter flow

Flow is closely bound to transactions protocol. Counters are modified during stages of 2PC protocol used by Ignite to  ensure distributed consistency.

In short, 2PC protocol consists of lock, prepare and commit phases.

We will use an example to understand counters flowTo explain how this is applied to achive consistency let us consider the example.

Suppose we have 3 server nodes and 1 client node in topology and topology version is N.

Two threads on single client start two transaction concurrently updating keys from belonging to the single partition p.

Tx1 enlists and updates keys k1, k2, k3, Tx2 enlists and updates keys k4, k5:k5.

F(k1) = F(k2) = F(k3) = F(k4) = F(k5) = p.

Ignite client = ...startClient();

IgniteCache<Integer, Integer> IgniteCache cache = client.cache(name);

...

try(Transaction tx1 = client.transactions().txStart()) {
      cache.put putAll(Map.of(k1, v1);      cache.put (, k2, v2);      cache.put (, k3, v3));
      tx1.commit();

}

Thread 2:

try(Transaction tx2 = client.transactions().txStart()) {
      cache.put putAll(Map.of(k4, v4);      cache.put (, k5, v5v4);

      tx2.commit();

}

2PC

Transactions use two-phase-commit protocol to consistently update all participating nodes.

The first step (near map) on the near(originating) node is to calculate the map of this transaction - for each partition p a set of involved primary nodes using the affinity function F(p) .

...