Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Due to implementation of these properties linear scalability is achieved when adding new nodes to the grid.

Re-balance

Rebalancing

RebalancingRe-balance - a procedure for exchanging data between nodes to ensure uniform distribution of data (redistribution of data between nodes) [1] and to ensure data consistency [2].

[1] Partition state based re-balancerebalancing
In the first case, the reason for the rebalance rebalancing is a change in the list of nodes - owners of the partition. If a new node has joined the topology then part of the partitions will be reassigned to it due to the affinity function property of the uniform distribution of partitions among the nodes. If the node is left then the re-balance will be triggered to ensure the availability of the required number of copies of the partition in the grid.

[2] Counter based re-balancerebalancing.
In the second case, which is relevant only for the enabled persistence mode, re-balance is used to restore the consistency of the data if the node was absent for some time in the topology under the load.
When the node re-joins the topology, the partition update counter will be analyzed to determine how much the partition has “lagged” from other copies.
Depending on how much the partition is lagging behind, the second type of rebalance can use the write-ahead log as a data source. In this case, the re-balance rebalancing can be performed without a complete reloading of the partition and is called “historical”.
Another scenario for the occurrence of [2] is the activation of the grid, where some nodes may have stale data.
Re-balance is the exchange of supply and demand messages between nodes. A demand message is sent from node where partition has stale data to node with actual data. Supply messages is send in response to demand message, after processing a supply message next demand message is send and so on until nothing more to supply.

Update counter structure

Each update to a partition receives a sequential number for a purpose of ensuring data consistency between copies. A special structure Update counter = [lwm, out-of-order updates tracking set, hwm] partition update counter is used to assign and increment update counters.


Partition update counter has the following structure:

/** Low watermark. */
private final AtomicLong cntr = new AtomicLong();

/** High watermark. */
protected final AtomicLong reserveCntr = new AtomicLong();

/** Queue of applied out of order counter updates. */
private TreeSet<Item> queue = new TreeSet<>();

Low watermark, or update counter is used to track sequential updates. It is only incremented when corresponding update is recorded to durable storage and no missed updates are exist with lesser counter value.

For example, LWM value=5 means what all updates with assigned counters 1, 2, 3, 4, 5 were applied to WAL.

High watermark or reservation counter is increment for each pending update, which even not guaranteed to succeed. Reservation counter is assigned during tx prepare phase.

Given all said, there is an important invariant: HWM >= LWM

Out of order update queue is used, literally, for recording updates applied out of order. This is possible due to a fact what updates this higher reservation counter could be applied before updates with lesser reservation counter, causing gaps in update sequence.

Consider the following example.

Suppose we have 3 server nodes and 1 client node in topology.

Two threads on client start two transaction updating keys from the single partition.

Tx1 enlists and updates keys k1, k2, k3, Tx2 enlists and updates keys k4, k5:

Ignite client = ...

IgniteCache cache = client.cache(name);

Thread 1:

try(Transaction tx1 = client.transactions().txStart()) {
      cache.put (k1, v1);

      cache.put (k2, v2);

      cache.put (k3, v3);
      tx1.commit();

}

Thread 2:

try(Transaction tx2 = client.transactions().txStart()) {
      cache.put (k4, v4);

      cache.put (k5, v5);

      tx2.commit();

}

Lets assume all locks are aquired and transactions are started to prepare almost simultaneously.




Consider an example of counter status:

...

The transaction starts on some node of the grid, it can be a client node, where there is now a version of the topology N and the partition distribution by nodes is calculated for it.
In Ignatian terminology, this is a near or originating The Apache Igntie project uses the following terminology for tx originating node: near node.
Ignite uses a 2 phase locking protocol to provide transactional guarantees.

...