...
Code Block |
---|
|
# Proposed directory structure
$ ls $IGNITE_HOME
db/
snapshots/
|-- SNP/
|---- db/
|---- incincrements/
|------ SNP_16409844000000000000000001/
|-------- meta.smf
|-------- waldb/
|---------- 0000000000000000.wal.zip
|------ SNP_1640985000binary_meta/
|-------- meta.smf
|-------- binary_metamarshaller/
|-------- 0000000000000000.wal/.zip |
Restore process
Code Block |
---|
|
// Restore cluster on specific incremental snapshot
$ control.sh --snapshot restore SNP --incremental 1 |
...
- User specifies incremental snapshot name
- Parses snapshot name and extracts base and incremental snapshots
- Additionally to full snapshot check (already exists in SnapshotRestoreProcess) it checks incremental snapshots:
- Checks that all WAL segments are presented (from ClusterSnapshotRecord to requested ConsistentCutFinishRecord).
- After full snapshot restore processes (prepare, preload, cacheStart) has finished, it starts another DistributedProcess - `walRecoveryProc`:
- Every node applies WAL segments since base snapshot while not reach requested ConsistentCutFinishRecord.
- Ignite should forbid concurrent operations (both read and write) for restored cache groups during WAL recovery.
- TBD Just notify user about it? Set a barrier for operations? Use != OWNING partition state?
Process of data applying for snapshot cache groups (from base snapshot) is similar to GridCacheDatabaseSharedManager
logical restore:- disable WAL for specified cache group
- find `ClusterSnapshotRecord` related to the base snapshot
- starts applying WAL updates with striped executor (cacheGrpId, partId). Apply filter for versions in ConsistentCutFinishRecord.
- enable WAL for restored cached groups
- force checkpoint and checking restore state (checkpoint status, etc).
...
For Atomic caches it's required to restore data consistency (primary and backup nodes) differently, with ReadRepair feature. Consistent Cut relies on transaction protocol' messages (Prepare, Finish). Atomic caches protocol doesn't have enough messages to sync different nodes.
TBD: Restore process should have suggest user perform an additional step if ATOMIC caches is restored:
- Check partitions state with `idle_verify` command;
- Start read-repair for non-consistent keys in lazy mode: on user get() operations related to broken cache keys.
...