Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

/** Queue of applied out of order counter updates. */
private TreeSet<Item> queue oomQueue = new TreeSet<>();

Low watermark, or update counter is used to track sequential updates. It is only incremented when corresponding update is recorded to durable storage and no missed updates are exist with lesser counter value.

...

Out of order update queue is used, literally, for recording updates applied out of order. This is possible due to a fact what updates this higher reservation counter could be applied before updates with lesser reservation counter, causing gaps in update sequence.

Consider the following To explain how this is applied to achive consistency let us consider the example.

Suppose we have 3 server nodes and 1 client node in topology.

...

      cache.put (k5, v5);

      tx2.commit();

}

A few lines about how transactions workTo deeper understand counter flow it is necessary to explain how ditributed transactions are working.

Txs use 2PC protocol to consistently update all participating nodes.

The first step (near map) on the near(originating) node is to calculate the map of this transaction - for each partition p a set of involved primary nodes . Since one key is involved in the transaction, there will be only one node in the card. using the affinity function F(p) .

For this partition on this topology version of the topology, a Lock lock request is sent that executes the lock (k) operation, the semantics of which are is similar to acquiring exclusuve key lockslock on given key (footnote: this is true for pessimistic transactions).

It may happen that on the primary node at the time of receipt of the request, the distribution on the version of the topology N + 1 is already relevant.

In this case, a response is sent to the near node indicating that a more current version of the topology is used. This process is called remap.

If the lock is successfully acquired , the prepare phase begins. At the primary node, there is a “reservation” of the update counter for the partition for the locked key — an increase in the hwm field. 

First, each enlisted key is assigned on update sequence number using HWM counter field.

Then In this phase, the primary node builds a backup backups map (dht map) and sends prepare requests to the backup nodes, if any, containing enlisted keys and their update numbers.

. These requests contain information about the update number for this key. Thus, update numbers are synchronized between primaries and backups - assigned to primaries, then transferred to backups.

If all the backup nodes answered positively, then the transaction goes into the PREPARED state. In this state, it is ready for commit (COMMIT).

During commit, all transactional changes are applied to the WAL and the counter of applied updates ONLY AFTER WHAT the LWM is increased. At this point, lwm LWM catches up with hwm.HWM.

Because transaction commits can reorder it's possible to have some updates applied out of order. The field oomQueue is used for tracking such updates.

The figure 3 shows transaction map for tx1:

TODO figure 3

Lets return to our example and Lets assume all locks are aquired and transactions are started to prepare almost simultaneously.

...

HWM is increment during tx prepare phase on the primary node to ensure the monotonicity of the counter and invariant HWM >= LWM holding.

Update counters are used to determine the state when the one partition copy is behind the other(s).

It is only safe to compare LWMs because of guarantee of presence of sequential updates up to LWM.

Lets assume partition N on node1 have LWM=m and partition N on node2 have LWM=k, m > k. This means partition on node2 is lagging behind node1 and number of missing updates is m-k.

Thus, it is very important to correctly track update counters, since thier values are used to calculate when rebalancing is necessary to launch to restore the integrity of partitions [2].

One more imporant overvation: LWM could safely be incremented if the update is written to the WAL. Otherwise, a situation is possible when the counter has increased, but there is no corresponding update.

Let us consider in more detail how counters are used to restore consistency (counters rebalance [2]) when outdated node joins topology.

...

The simplified rebalancing flow is illustratred on fugure 3.

TODO fig. 3

Update counters are used to determine the state when the one partition copy is behind the other(s).

It is only safe to compare LWMs because of guarantee of presence of sequential updates up to LWM.

Thus, it is very important to correctly track update counters, since thier values are used to calculate when rebalancing is necessary to launch to restore the integrity of partitions [2].

One more imporant overvation: LWM could safely be incremented if the update is written to the WAL. Otherwise, a situation is possible when the counter has increased, but there is no corresponding update.

Crash recovery

There is no guarantee that this update will be completed, for example, if during the dht prepare transaction some nodes will exit, the transaction may be rolled back.
When a node’s primary collapse under load situations may arise where may be permanent gaps in the updates on backup nodes in the event of transaction reordering.

...