Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. User specifies full snapshot name
  2. Parses snapshot name and extracts base and incremental snapshots
  3. Additionally to full snapshot check (already exists in SnapshotRestoreProcess) it checks incremental snapshots:
    1. Checks that all WAL segments are presented (from ClusterSnapshotRecord to requested IncrementalSnapshotFinishRecord).
  4. After full snapshot restore processes (prepare, preload, cacheStart) has finished, it starts another DistributedProcess - `walRecoveryProc`:
    1. Every node applies WAL segments since base snapshot while not reach requested IncrementalSnapshotFinishRecord.
    2. Ignite should forbid concurrent operations (both read and write) for restored cache groups during WAL recovery.
    3. Process of data applying for snapshot cache groups (from base snapshot) is similar to GridCacheDatabaseSharedManager logical restore:
      1. disable WAL for specified cache group
      2. find `ClusterSnapshotRecord` related to the base snapshot
      3. starts applying WAL updates with striped executor (cacheGrpId, partId). Apply filter for versions in ConsistentCutFinishRecord.
      4. enable WAL for restored cached groups
      5. force checkpoint and checking restore state (checkpoint status, etc).

Checking snapshot

Code Block
languagebash
// Check specific incremental snapshot
$ control.sh --snapshot check SNP --increment 1

With control.sh --snapshot check  command:

Check includes following steps on every baseline node:

  1. Check snapshot files are consistent:
    1. Snapshot structure is valid and metadata matches actual snapshot files
    2. all WAL segments are presented (from ClusterSnapshotRecord to requested IncrementalSnapshotFinishRecord).
  2. Check snapshot incremental snapshot data integrity:
    1. It parses WAL segments from the first incremental snapshot to the specified one (with --increment param).
    2. For every partition it calculates hashes for entries, and for entry versions.
      1. On the reduce phase it compares partitions hashes between primary and backup copies.
    3. For every pair of nodes that participated as primary nodes it calculates hash of committed transactions. For example:
      1. There are two transactions:
        1. TX1, and there are 2 nodes that participates in it as primary nodes: A and B
        2. TX2, and there are 2 nodes: A and C
      2. On node A it prepares 2 collections: TxHashAB = [hash(TX1)], TxHashAC = [hash(TX2)]
      3. On node B it prepares 1 collection: TxHashBA = [hash(TX1)]
      4. On node C it prepares 1 collection: TxHashCA = [hash(TX2)]
      5. On the reduce phase of the check it compares collections from all nodes and expects that:
        1. TxHashAB equals TxHashBA
        2. TxHashAC equals TxHashCA

Note that incremental snapshot doesn't check data of related full snapshot. Then full check of snapshot will consist of two steps:

  1. Check full snapshot
  2. Check incremental snapshot

Atomic caches

For Atomic caches it's required to restore data consistency (primary and backup nodes) differently, with ReadRepair feature. Consistent Cut relies on transaction protocol' messages (Prepare, Finish). Atomic caches protocol doesn't have enough messages to sync different nodes.

...