Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

[2] Counter based rebalancing.
In the second case, which is relevant only for the persistence mode, re-balance is used to restore the consistency of the data if the node was absent for some time in the topology under the load.
When the node re-joins the topology, the partition update counter will be analyzed to determine how much the partition has “lagged” from other copies.
Depending on how much the partition is lagging behind, the second type of rebalance can use the write-ahead log as a data source. In this case, the rebalancing can be performed without a complete reloading of the partition and is called “historical”.
Another scenario for the occurrence of [2] is the activation of the grid, where some nodes may have stale data.
Re-balance is the exchange of supply and demand messages between nodes. A demand message is sent from node where partition has stale data to node with actual data. Supply messages is send in response to demand message, after processing a supply message next demand message is send and so on until nothing more to supply.

Update counter structure and flow

Each update to a partition receives a sequential number for a purpose of ensuring data consistency between copies. A special structure partition update counter is used to assign and increment update counters.

...

      cache.put (k5, v5);

      tx2.commit();

}

Lets assume all locks are aquired and transactions are started to prepare almost simultaneously.

The diagram shows counter state updates for primary and backup during tx processing.

TODO pic

HWM is increment only on the primary node to ensure the monotonicity of the counter and invariant HWM >= LWM holding.

Update counters are used to determine the state when the one partition copy is behind the other(s).

It is only safe to compare LWMs because of guarantee of presence of sequential updates up to LWM.

Lets assume partition N on node1 have LWM=m and partition N on node2 have LWM=k, m > k. This means partition on node2 is lagging behind node1 and number of missing updates m-k.

Thus, it is extremely important to correctly track update counters, since thier values are used to calculate when  rebalancing is necessary to launch to restore the integrity of partitions [2].

Example
Consider how integrity is achieved using the transactional key insertion operation as an example. The topology has 4 nodes, 3 server and one client.

tx.start
cache.put (k, v) // lock
tx.commit // prepare, commit

A few lines about how transactions work.

TODO enrich

Txs use 2PC protocolThe transaction starts on some node of the grid, it can be a client node, where there is now a version of the topology N and the partition distribution by nodes is calculated for it.
The Apache Igntie project uses the following terminology for tx originating node: near node.
Ignite uses a 2 phase locking protocol to provide transactional guarantees.

The first step (near map) on the near node is to calculate the map of this transaction - a set of involved primary nodes. Since one key is involved in the transaction, there will be only one node in the card. For this partition on this version of the topology, a Lock request is sent that executes the lock (k) operation, the semantics of which are similar to acquiring exclusuve key locks.

...

If the lock is successfully acquired, the prepare phase begins. At the primary node, there is a “reservation” of the update counter for the partition for the locked key — an increase in the hwm field.There is no guarantee that this update will be completed, for example, if during the dht prepare transaction some nodes will exit, the transaction may be rolled back.

In this phase, the primary node builds a backup map (dht map) and sends prepare requests to the backup nodes, if any. These requests contain information about the update number for this key. Thus, update numbers are synchronized between primaries and backups - assigned to primaries, then transferred to backups.

If all the backup nodes answered positively, then the transaction goes into the PREPARED state. In this state, it is ready for commit (COMMIT). During commit, all transactional changes are applied to the WAL and the counter of applied updates is increased. At this point, lwm catches up with hwm..



Lets assume all locks are aquired and transactions are started to prepare almost simultaneously.

The diagram shows counter state updates for primary and backup during tx processing by two transactions with reordered prepare/commit.

TODO pic

HWM is increment during tx prepare phase on the primary node to ensure the monotonicity of the counter and invariant HWM >= LWM holding.

Update counters are used to determine the state when the one partition copy is behind the other(s).

It is only safe to compare LWMs because of guarantee of presence of sequential updates up to LWM.

Lets assume partition N on node1 have LWM=m and partition N on node2 have LWM=k, m > k. This means partition on node2 is lagging behind node1 and number of missing updates is m-k.

Thus, it is very important to correctly track update counters, since thier values are used to calculate when rebalancing is necessary to launch to restore the integrity of partitions [2].

One more imporant overvation: LWM could safely be incremented if the update is written to Important: correctly update the counter of applied updates only if the update is in the WAL. Otherwise, a situation is possible when the counter has increased, but there is no corresponding update.


TODO counter flow pic (fig.1).Rebalancing

Let us consider in more detail how counters are used to restore consistency (rebalance [2]).

...

TODO rebalance flow pic (fig.2).

Crash recovery

There is no guarantee that this update will be completed, for example, if during the dht prepare transaction some nodes will exit, the transaction may be rolled back.
When a node’s primary collapse under load situations may arise where may be permanent gaps in the updates on backup nodes in the event of transaction reordering.

...

In this case, the gaps will be closed by special king of message PartitionUpdateCountersRequest on tx rollback, and similarly closed in WAL using RollbackRecord if persistence is enabled.

TODO fig

Switching primaries

If the primary node fell out, then one of the backups becomes the new node, and due to the PME protocol atomic switching to the new primary takes place, where hwm becomes equal to the last known lwm for partition (see Figure 2).
It is important to ensure the atomicity of switching the node’s primaries for partitioning in node entry or exit scenarios, otherwise the HWM >= LWM invariant will break, which ensures that if at least one stable copy is available, everyone else can restore its integrity from it. This is achieved due to the fact that transactions on the previous topology are completed before switching (as a phase of the PME protocol) to a new one and the situation when two primaries of a node simultaneously exist in a grid for one partition is impossible.

...