Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Code Block
languagebash
# Proposed directory structure
$ ls $IGNITE_HOME
db/
snapshots/
|-- SNP/
|---- db/
|---- incincrements/
|------ SNP_16409844000000000000000001/
|-------- meta.smf
|-------- waldb/
|---------- 0000000000000000.wal.zip
|------ SNP_1640985000binary_meta/
|-------- meta.smf
|-------- binary_metamarshaller/
|-------- 0000000000000000.wal/.zip

Restore process

Code Block
languagebash
// Restore cluster on specific incremental snapshot
$ control.sh --snapshot restore SNP --incremental 1

...

  1. User specifies incremental snapshot name
  2. Parses snapshot name and extracts base and incremental snapshots
  3. Additionally to full snapshot check (already exists in SnapshotRestoreProcess) it checks incremental snapshots:
    1. Checks that all WAL segments are presented (from ClusterSnapshotRecord to requested ConsistentCutFinishRecord).
  4. After full snapshot restore processes (prepare, preload, cacheStart) has finished, it starts another DistributedProcess - `walRecoveryProc`:
    1. Every node applies WAL segments since base snapshot while not reach requested ConsistentCutFinishRecord.
    2. Ignite should forbid concurrent operations (both read and write) for restored cache groups during WAL recovery.
    3. TBD Just notify user about it? Set a barrier for operations? Use != OWNING partition state?
    4. Process of data applying for snapshot cache groups (from base snapshot) is similar to GridCacheDatabaseSharedManager logical restore:
      1. disable WAL for specified cache group
      2. find `ClusterSnapshotRecord` related to the base snapshot
      3. starts applying WAL updates with striped executor (cacheGrpId, partId). Apply filter for versions in ConsistentCutFinishRecord.
      4. enable WAL for restored cached groups
      5. force checkpoint and checking restore state (checkpoint status, etc).

...

For Atomic caches it's required to restore data consistency (primary and backup nodes) differently, with ReadRepair feature. Consistent Cut relies on transaction protocol' messages (Prepare, Finish). Atomic caches protocol doesn't have enough messages to sync different nodes.

TBD: Restore process should have suggest user perform an additional step if ATOMIC caches is restored:

  1. Check partitions state with `idle_verify` command;
  2. Start read-repair for non-consistent keys in lazy mode: on user get() operations related to broken cache keys.

...