Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • sent and received might be wider (- collect transaction that not only PREPARED, but ACTIVE+PREPARING+. Do not track ACTIVE transactions (as algorithm guarantees they will be excluded from snapshot: ACTIVE < PREPARED than such txs aren't part neither of ChannelState or LocalState).
  • auto-shrink must be disabled during a snapshot process.

...

  • Introduce a new collection excluded that is a collection of messages that changed LocalState concurrently with local snapshot, but they aren't part of the snapshot ChannelState (but part of next ChannelState, after snapshot).
  • Then Ignite need a rule to distinct different ChannelStates (before and after snapshot) . There are possible solutions:

...

  • - add additional flag to FinishMessage that shows which side of snapshot this transaction belongs to on near node.

...

  1. but order of transactions is not guaranteed, then snapshot may include a transaction with version that is greater than a version of transaction after the snapshot.
  2. it's possible to overcome with additional step (locally collecting transactions for additional exclude after all transactions committed).

Also there is no need to Also there is no need to send send collection with FinishMessage. Then whole lock is reduced to single volatile variable:

...

  1. Initial state:
    1. Ignite nodes started from snapshot, or other consistent state (after graceful cluster stop / deactivation).
    2. var color = WHITE Every Ignite node holds a color (RED / WHITE / null)initially null.
    3. Empty collection collection committingTxs (Set<GridCacheVersion>) that goal is to track COMMITTING+ transactions, that aren't part of IgniteTxManager#activeTx . It's automatically shrinks after transaction committed.
  2. After some time, Ignite nodes might have non-empty committingTxs.
  3. Ignite node inites a global snapshot, by starting DistributedProcess (by discovery IO):
    1. switches a color (null → WHITE → RED → WHITE → ...)
    2. creates a new timestamp, checks that it's greater than previous (incremental snapshot).
    3. prepares a marker message that contains the color  and the timestamp. And transmits this message to other nodes.
  4. Every nodes starts a local snapshot process after receiving a mark the marker message (whether by discovery, or by communication with transaction message) 
    1. Atomically: set color = RED and disable updates local `color`, disable auto-shrink of committingTxsprepare ConsistentCut future.
    2. write Write a snapshot record to WAL with the received timestamp (commits LocalState).
    3. Collect of active transactions - concat of IgniteTxManager#activeTx and committingTxs 
    4. While receiving Finish messages from other nodes, node fills ChannelState: exclude and (sent - received) collections. 
    5. After all transactions finished, it writes a WAL record with ChannelState.
  5. New color is sent with transaction Finish messages.
  6. Committing node add an additional color field for FinishMessage, that shows whether to include transaction to snapshot, or not. 
  7. Other nodes on receiving a marker with new color starts a local snapshot (see steps from point 3).
  8. Notifies a node-initiator about finishing local procedure (with DistributedProcess protocol).
  9. For all nodes color = RED . Next snapshot iteration will started with changing color to WHITE . 

Use node-local xid (GridCacheVersion) as snapshot version threshold

To avoid using mark message (color field) we can try rely on fixed GridCacheVersion. Algorithm is as follows:

  1. Initial state:
    1. Ignite nodes started from snapshot, or other consistent state (after graceful cluster stop / deactivation).
  2. Ignite node inites a global snapshot, by starting DistributedProcess (by discovery IO).
  3. Every (incl. client and non-baseline) node starts a local snapshot process after receiving a message from DistributedProcess. 
  4. Phase 1: 
    1. fix snpVersion  = GridCacheVersion#next(topVer)
    2. Collect all active transactions originated by near node, which nearXidVersion is less than snpVersion 
    3. Note, continue collecting transactions that are less than snpVersion , and for which local node is near (to exclude them later)
    4. after finishing all collected transactions: notify Ignite nodes with snpVersion (with DistributedProcess protocol).
  5. After all nodes finished first phase, they received Map<UUID, GridCacheVersion> from other nodes. 
  6. Phase 2: Only server baseline nodes continue work there:
    1. Collect all active transactions, find their near node (by GridCacheVersion#nodeOrderId), filter them with known GridCacheVersion
    2. Await all such transaction completed.
    3. Write WAL record with the received map.
  7. Phase 3: 
    1. Stop collecting near transactions that are less than local snpVersion and send them to other nodes.
    2. On receiving such map, write a new WAL record again, that contains additional skip collection.
  8. After finishing Phase 3, process of snapshot is finished.

Restoring process:

  1. Find WAL records for Phase 2, 3 - find map GridCacheVersions transactions to filter, and additional transactions xids to exclude (from Phase 3).
  2. Apply all records with the filters until record from Phase 2.

Disadvantages:

    1. Prepares 2 empty collections - before [sent - received] and after [exclude] cut.
  1. While global Consistent Cut is running every node signs output transaction messages:
    1. The marker message.
    2. Finish messages is signed on node that commits first (near node for 2PC, backup or primary for 1PC) with color  to notify other nodes which side of cut transaction belongs to.
  2. For every collected active transaction, node waits for Finish message, to extract the color and fills before, after collections:
    1. if received color is null or differs from local, then transaction on before side
    2. if received color equals to local, then transaction on after side
  3. After all transactions finished:
    1. Writes a WAL record with ChannelState (before, after). 
    2. Clears committingTxs and enables auto-shrink again.
    3. Completes ConsistentCut future, and notifies a node-initiator about finishing local procedure (with DistributedProcess protocol).
  4. After all nodes finished ConsistentCut, every node stops signing outgoing transaction messages.
  5. Every node, now have updated color (non-null)
  6. Increments of GridCacheVersions is CAS operations from different threads. But the version is assigned to a transaction in non-atomic way. No guarantee that snpVersion  is greater than version of transaction created after fixing snpVersion. Ignite should track such transactions:
    1. With fair locking while creating and assigning version to transaction - possible performance degradation.
    2. With additional filter after preparing a snapshot (4.d) - now there are 3 steps for preparing snapshot.
  7. For case OPTIMISTIC + PRIMARY_SYNC we can miss backup transaction - looks like we need restoring by PREPARED messages in WAL too? TBD (or on restoring?)  
  8. Client and non-baseline nodes has a job to do: collecting of transactions, awaiting them finished, sending a response. It could be non-reliable, as client nodes can be short-lived:
    1. Also should handle special cases when transaction is committed after client node gone and there is no info about it actual version.
  9. No safe previous record to restore, if some incremental snapshots created. Need to filter all history.

Consistent and inconsistent Cuts

...

  1. any errors appeared during processing local Cut.
  2. if a transaction is recovered with transaction recovery protocol (txCutVer is unknowntx.finalizationStatus == RECOVERY_FINISH).
  3. if transaction finished in UNKNOWN state.
  4. baseline topology change, Ignite nodes finishes local Cuts running in this moment, making them inconsistent.

...