Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Update counters are used to determine the state when the backup is behind the primaries. A lag can occur, for example, if the node went out under load and while there were no updates to one of the partitions for which it is the owner. If for each partition the number of updates is the same, then the partitions are identical.
If on one of the partitions the counter lags by k, then it means you need to use k lagging updates to synchronize the partitions. Only relevant for persistent mode.
For in-memory, the partition is always restored from scratch.
Thus, it is extremely important to correctly track update counters, since it depends on them whether a rebalance will be launched to restore the integrity of partitions [2].

Example
Consider how integrity is achieved using the transactional key insertion operation as an example. The topology has 4 nodes, 3 server and one client.

tx.start
cache.put (k, v) // lock
tx.commit // prepare, commit

The transaction starts on some node of the grid, it can be a client node, where there is now a version of the topology N and the partition distribution by nodes is calculated for it.
In Ignatian terminology, this is a near or originating node.
Ignite uses a 2 phase locking protocol to provide transactional guarantees.

The first step (near map) on the near node is to calculate the map of this transaction - a set of involved primary nodes. Since one key is involved in the transaction, there will be only one node in the card. For this partition on this version of the topology, a Lock request is sent that executes the lock (k) operation, the semantics of which are similar to acquiring exclusuve key locks.

It may happen that on the primary node at the time of receipt of the request, the distribution on the version of the topology N + 1 is already relevant. In this case, a response is sent to the near node indicating that a more current version of the topology is used. This process is called remap.

If the lock is successfully acquired, the prepare phase begins. At the primary node, there is a “reservation” of the update counter for the partition for the locked key — an increase in the hwm field.

There is no guarantee that this update will be completed, for example, if during the dht prepare transaction some nodes will exit, the transaction may be rolled back.

In this phase, the primary node builds a backup map (dht map) and sends prepare requests to the backup nodes, if any. These requests contain information about the update number for this key. Thus, update numbers are synchronized between primaries and backups - assigned to primaries, then transferred to backups.

If all the backup nodes answered positively, then the transaction goes into the PREPARED state. In this state, it is ready for commit (COMMIT). During commit, all transactional changes are applied to the WAL and the counter of applied updates is increased. At this point, lwm catches up with hwm.

Important: correctly update the counter of applied updates only if the update is in the WAL. Otherwise, a situation is possible when the counter has increased, but there is no corresponding update.

TODO counter flow pic (fig.1).

Let us consider in more detail how counters are used to restore consistency (rebalance [2]).

When a node enters, it sends for each partition during PME lwm, as well as the depth of the history that it has on it to determine the possibility of historical rebalance. History is determined by a pair of counters (earlies_checkpoint_counter, lwm). The counter guarantees that there are no missed updates inside this story.
At the same time, the node may have missed updates with a higher number.
On the coordinator, lwm is compared and the node (or nodes) with the highest lwm value is selected, and this information is sent to the incoming node, which starts the rebalance procedure if it remains. The rebalance will be either complete with a preliminary clearing of the partition, or historical, if any node contains the necessary update history in the WAL.

TODO rebalance flow pic (fig.2).

Crash recovery
When a node’s primary collapse, a situation may arise where there may be forever gaps in the updates on backup nodes in the event of transaction reordering: transactions that started later with a higher counter value were completed before the transaction with a lower counter value was fixed and then the primary dropped, without sending the missing updates.
In this case, the pass will be closed on node left PME, and similarly closed in WAL using RollbackRecord if persistence is enabled

In the case of tx recovery, if it is decided to roll back the transaction, a situation is possible when one of the dht nodes did not receive information about the start of the transaction, while the others received it.
This will result in a counter out of sync. To prevent dht problems, the nodes will exchange information about the counters in the neightborcast request with each other, and thus the counters in the partitions on all the host nodes will be aligned.

If the primary node fell out, then one of the backups becomes the new node, and due to the PME protocol atomic switching to the new primary takes place, where hwm becomes equal to the last known lwm for partition (see Figure 2).

Switching primaries
It is important to ensure the atomicity of switching the node’s primaries for partitioning in node entry or exit scenarios, otherwise the LWM <= HWM invariant will break, which ensures that if at least one stable copy is available, everyone else can restore its integrity from it. This is achieved due to the fact that transactions on the previous topology are completed before switching (as a phase of the PME protocol) to a new one and the situation when two primaries of a node simultaneously exist in a grid for one partition is impossible.

In the event of a node crash under load after restart, local recovery from WAL is started. At the same time, part of the updates could be skipped. At the start, during WAL recovery, the correct lwm will be determined due to the fact that we store gaps in the counter and is used for rebalance. Rebalance will close all gaps, and when you enter the node, you can clear them.