...
- User specifies incremental snapshot name
- Parses snapshot name and extracts base and incremental snapshots
- Additionally to full snapshot check (already exists in SnapshotRestoreProcess) it checks incremental snapshots:
- For every incremental snapshot: extract segments list from the meta file and checks that WAL segments are presented.
- Order incremental snapshots by ConsistentCutVersion (from the meta file), and checks no misses of WAL segments since base snapshot.
- On reducer it checks that all nodes have the same ConsistentCutVersion on all nodes for specified incremental snapshot.
- After full snapshot restore processes (prepare, preload, cacheStart) has finished, it starts another DistributedProcess - `walRecoveryProc`:
- Reducer sends common ConsistentCutVersion to all nodes;
- Every node applies WAL segments since base snapshot while not reach ConsistentCutFinishRecord for specified ConsistentCutVersion.
- Ignite should forbid concurrent operations (both read and write) for restored cache groups during WAL recovery.
- TBD Just notify user about it? Set a barrier for operations? Use != OWNING partition state?
- Process of data applying for snapshot cache groups (from base snapshot) is similar to
GridCacheDatabaseSharedManager
logical restore:- disable WAL for specified cache group
- find `ClusterSnapshotRecord` related to the base snapshot
- starts discretely (cut by cut) applying WAL updates with striped executor (cacheGrpId, partId).
- enable WAL for restored cached groups
- force checkpoint and checking restore state (checkpoint status, etc).
- update local ConsistentCutVersion (in ConsistentCutManager) with restored one.
...
- Command that allows to create an incremental snapshot basing on the specified base (full) snapshot.
Limitations:
- IS creation fails if cache schema changed since base snapshot. Schemas are restored from full snapshot, while an incremental snapshot (IS) restores only data changes.
- Compare base snapshot cache_data.data with current cache info, fail if it has changed.
- IS creation fails if a baseline node was rebalanced since base snapshot.
- Check rebalance fact for every cache group with `RebalanceFuture#isInitial()` on node start – it is null if joining node doesn't need to be rebalanced.
- This fact should be written to MetaStorage and checked before incremental snapshot (by analogue with
GridCacheDatabaseSharedManager#isCheckpointInapplicableForWalRebalance
).
- IS creation fails if a baseline topology changed since base snapshot.
- Baseline topology is checked relatively to base snapshot.
- IS creation fails if user tries to create it after restoring cluster on previously created incremental snapshot.
- Command that allows to restore specified incremental snapshot.
Limitations:
- Restoring on different topology is not allowed.
- IS guarantees consistency for Transactional caches only. Ignite should write WARN into log with suggestion to run
idle_verify
check for Atomic caches, and restore them with ReadRepair
if needed. - Does not protect cache groups from concurrent operations (both read, write), just WARN into log that restoring cache groups MUST be idle until operation finished.
- Snapshot SystemView contains info about incremental snapshots.
- Log messages with the Metrics info.
...
- Restoring of incremental snapshot should be able to overcome WAL inconsistency caused by rebalance.
- Improve the transaction recovery mechanism: recovery messages now are packed with ConsistentCutVersion, if it was set.
- Strictly forbid concurrent operations while restoring.
Phase 3
- Restoring of incremental snapshot should handle inconsistency of Atomic Caches.
...
{"serverDuration": 168, "requestCorrelationId": "d152d6b1cfdd7303"}