Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

The document describes principles used for achieving consistency between data copies for transactional caches with enabled persistence.

Cluster topology

Cluster topology is a ordered set of client and server nodes at a certain point in time. Each stable topology is assigned a topology version number - monotonically increasing pair of counters (majorVer, minorVer). Each node maintains a list of nodes in the topology and it is the same on all nodes for given version. It is important to note that each node receives information about the topology change eventulally, that is, independently of the other node at the different time.

...

      tx2.commit();

}

The figure 1 below shows the flow with the case of backup reordering, where tx with higher update counters is applied before tx with lower update counters on backup.

...

If coordinator detects lagging node it sends a partition map with partition state changed to MOVING triggering a rebalance on joining node. The missing number if of updates for each partition is calculated as a difference between highest known LWM and LWM from joining node.

The simplified rebalancing flow for node's join having outdated partition with id=0 is illustratred on the fugure 2:

Update counters are used to determine the state when the one partition copy is behind the other(s)others.

It is only safe to compare LWMs because of guarantee of presence of sequential updates up to LWMall updates with counters below LWM are guaranteed to exist.

Thus, it is very important to correctly track LWMs since thier values are used to calculate when rebalancing is necessary to launch to restore the integrity of partitions [2].

As already was mentioned, LWM could safely be incremented only if the update is written to the WAL. Otherwise, a situation is possible when the counter has increased, but there is no corresponding durable update.

Crash recovery

There is are no guarantee guarantees that updates with already assigned counter counters will be completedever applied.

For example, if during DHT prepare phase some mapped nodes will exit the transaction will be rolled back .
When a primary node fails under load situations may arise where may be permanent gaps in the updates on backup nodes in the event of transaction reordering.

1.

Transactions that started later with a higher counter value were prepared before the transaction with a lower counter value were able to be prepared on at least one backup.
In this case, the gaps will be closed on node left PME (TODO add link), and similarly closed in WAL using RollbackRecord if persistence is enabled.

or node can crash during commit.
This is the reasons for gaps in update sequence to appear. Several additional mechanisms are introduced to close such gaps.

Consider figure 3:

Image Added

In the scenario on fig.3 Tx2 is committed before Tx1 had the chance to prepare and primary node fails.

All existing gaps remaining after all transactions are finished closed during PME on node left. This happens in orgThis scenario is handled by org.apache.ignite.internal.processors.cache.PartitionUpdateCounter#finalizeUpdateCounters

2.

Transactions that started later with a higher counter value were prepared before the transaction with a lower counter value were able to be prepared only on some backups, but not all. This means some of backups never seen a counter update.

In this case, the gaps will be closed by special king of message PartitionUpdateCountersRequest on tx rollback, and similarly closed in WAL using RollbackRecord if persistence is enabled.

TODO fig

Importance of atomic primaries switching

.

Consider figure 4:

Image Added

In the scenario on fig.4 transaction is prepared only on one backup and primary fails. A special partition counters update request is used to sync counters with remaining backup node.

Importance of atomic primaries switching

As shown on fig. 4, when If the primary node fails then one of the backups must become a new primary according to affinity assignment for next topology version. This is happens during counters exchange on PME.

A new primary adjusts it's HWM to max known LWM value and serves as new primary generating next counters becomes the new node, and due to the PME protocol atomic switching to the new primary takes place, where hwm becomes equal to the last known lwm for partition (see Figure 2).
It is important to ensure the atomicity of switching the node’s primaries for partitioning in node entry or exit scenarios, otherwise the HWM >= LWM invariant will break, which ensures that if at least one stable copy is available, everyone else can restore its integrity from it.

This is achieved due to the fact that transactions on the previous topology are completed before switching (as a phase of the PME protocol) to a new one and the situation when two primaries of a node simultaneously exist in a grid for one partition is impossible.In the event of a node crash under load after restart, local recovery from WAL is started. At the same time, part of the updates could be skipped. At the start, during WAL recovery, the correct lwm will be determined due to the fact that we store gaps in the counter and is used for rebalance. Rebalance will close all gaps, and when you enter the node, you can clear them.