Page History

...

User specifies incremental snapshot name
Parses snapshot name and extracts base and incremental snapshots
Additionally to full snapshot check (already exists in SnapshotRestoreProcess) it checks incremental snapshots:
1. For every incremental snapshot: extract segments list from the meta file and checks that WAL segments are presented.
2. Order incremental snapshots by ConsistentCutVersion (from the meta file), and checks no misses of WAL segments since base snapshot.
3. On reducer it checks that all nodes have the same ConsistentCutVersion on all nodes for specified incremental snapshot.
After full snapshot restore processes (prepare, preload, cacheStart) has finished, it starts another DistributedProcess - `walRecoveryProc`:
1. Reducer sends common ConsistentCutVersion to all nodes;
2. Every node applies WAL segments since base snapshot while not reach ConsistentCutFinishRecord for specified ConsistentCutVersion.
3. Ignite should forbid concurrent operations (both read and write) for restored cache groups during WAL recovery.
  1. TBD Just notify user about it? Set a barrier for operations? Use != OWNING partition state?
4. Process of data applying for snapshot cache groups (from base snapshot) is similar to GridCacheDatabaseSharedManager logical restore:
  1. disable WAL for specified cache group
  2. find `ClusterSnapshotRecord` related to the base snapshot
  3. starts discretely (cut by cut) applying WAL updates with striped executor (cacheGrpId, partId).
  4. enable WAL for restored cached groups
  5. force checkpoint and checking restore state (checkpoint status, etc).
  6. update local ConsistentCutVersion (in ConsistentCutManager) with restored one.

...

Command that allows to create an incremental snapshot basing on the specified base (full) snapshot.
Limitations:
1. IS creation fails if cache schema changed since base snapshot. Schemas are restored from full snapshot, while an incremental snapshot (IS) restores only data changes.
  1. Compare base snapshot cache_data.data with current cache info, fail if it has changed.
2. IS creation fails if a baseline node was rebalanced since base snapshot.
  1. Check rebalance fact for every cache group with `RebalanceFuture#isInitial()` on node start – it is null if joining node doesn't need to be rebalanced.
  2. This fact should be written to MetaStorage and checked before incremental snapshot (by analogue with GridCacheDatabaseSharedManager#isCheckpointInapplicableForWalRebalance).
3. IS creation fails if a baseline topology changed since base snapshot.
  1. Baseline topology is checked relatively to base snapshot.
4. IS creation fails if user tries to create it after restoring cluster on previously created incremental snapshot.
Command that allows to restore specified incremental snapshot.
Limitations:
1. Restoring on different topology is not allowed.
2. IS guarantees consistency for Transactional caches only. Ignite should write WARN into log with suggestion to run idle_verify check for Atomic caches, and restore them with ReadRepair if needed.
3. Does not protect cache groups from concurrent operations (both read, write), just WARN into log that restoring cache groups MUST be idle until operation finished.
Snapshot SystemView contains info about incremental snapshots.
Log messages with the Metrics info.

...

Restoring of incremental snapshot should be able to overcome WAL inconsistency caused by rebalance.
Improve the transaction recovery mechanism: recovery messages now are packed with ConsistentCutVersion, if it was set.
Strictly forbid concurrent operations while restoring.

Phase 3

Restoring of incremental snapshot should handle inconsistency of Atomic Caches.

...

Page tree

Versions Compared

Old Version 117

New Version 118

Key

Phase 3