Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. Makes checks:
    1. Base snapshot (at least its metafile) exists. Exists metafile for all incremental snapshots.
    2. Validate that no misses in WAL segments since previous snapshot. SnapshotMetadata should contain info about last WAL segment that contains snapshot data:
      1. If snapshot is fullClusterSnapshotRecord must be written with rolloverType=NEXT_SEGMENT. A previous segment number is written to WAL, segment number of this record is stored within existing structure SnapshotMetadata.
      2. If snapshot is incremental: stored segment number is a segment that contains ConsistentCutFinishRecord.
    3. Check that new ConsistentCutVersion is greater than version of previous snapshots.Check that baseline topology is the same (relatively to base snapshot).
    4. Check that WAL is consistent (there was no disabling WAL since previous snapshot) - this info is stored into MetaStorage.
  2. Starts a new Consistent Cut.
  3. On finish Consistent Cut:
    1. if cut is consistent: log ConsistentCutFinishRecord with rolloverType=CURRENT_SEGMENT  to enforce archiving the segment after logging the record.
    2. if cut is inconsistent: skip log ConsistentCutFinishRecord and retry since 1.
    3. fail if retry attempts are exceeded.
  4. Awaits the segment with ConsistentCutFinishRecord has been archived and compacted.
  5. Collects WAL segments since next of the last segment of previous snapshot (`prev + 1`) until segment that contains ConsistentCutFinishRecord for current incremental snapshot .(from previous snapshot to ConsistentCutFinishRecord).
  6. Creates hardlinks to the compressed segments into target directory.
  7. Writes a meta files with description of the new incremental snapshot:
    1. meta.smf: 
      1. ConsistentCutVersionPointer to ConsistentCutFinishRecord.
      2. WAL segments participated in this snapshot.
    2. binary_meta, marshaller_data if it changed since previous snapshot.

...

Code Block
languagebash
// Restore cluster on specific incremental snapshot
$ control.sh --snapshot restore SNP_1640984400 --incremental 1

With control.sh --snapshot restore  command:

  1. User specifies incremental snapshot name
  2. Parses snapshot name and extracts base and incremental snapshots
  3. Additionally to full snapshot check (already exists in SnapshotRestoreProcess) it checks incremental snapshots:
    1. For every incremental snapshot: extract segments list from the meta file and checks that WAL segments are presented.
    2. Order incremental snapshots by ConsistentCutVersion (from the meta file), and checks no misses of WAL segments since base snapshot.
    3. On reducer it checks that all nodes have the same ConsistentCutVersion on all nodes for specified incremental snapshotChecks that all WAL segments are presented (from ClusterSnapshotRecord to requested ConsistentCutFinishRecord).
  4. After full snapshot restore processes (prepare, preload, cacheStart) has finished, it starts another DistributedProcess - `walRecoveryProc`:Reducer sends common ConsistentCutVersion to all nodes;
    1. Every node applies WAL segments since base snapshot while not reach reach requested ConsistentCutFinishRecordfor specified ConsistentCutVersion.
    2. Ignite should forbid concurrent operations (both read and write) for restored cache groups during WAL recovery.
      1. TBD Just notify user about it? Set a barrier for operations? Use != OWNING partition state?
    3. Process of data applying for snapshot cache groups (from base snapshot) is similar to GridCacheDatabaseSharedManager logical restore:
      1. disable WAL for specified cache group
      2. find `ClusterSnapshotRecord` related to the base snapshot
      3. starts discretely (cut by cut) applying WAL updates with striped executor (cacheGrpId, partId). Apply filter for versions in ConsistentCutFinishRecord.
      4. enable WAL for restored cached groups
      5. force checkpoint and checking restore state (checkpoint status, etc).update local ConsistentCutVersion (in ConsistentCutManager) with restored one.

Atomic caches

For Atomic caches it's required to restore data consistency (primary and backup nodes) differently, with ReadRepair feature. Consistent Cut relies on transaction protocol' messages (Prepare, Finish). Atomic caches protocol doesn't have enough messages to sync different nodes.

...