ID	IEP-89
Author	Maksim Timonin
Sponsor	Anton Vinogradov Ivan Daschinsky Nikolay Izhikov Maxim Muzafarov
Created	16.05.2021
Status	ACTIVE

Motivation

User might want to restore cluster after crash or data loss to a specific point in past (before accident) with low RPO and RTO metrics (https://en.wikipedia.org/wiki/Disaster_recovery). It's also assumed that cluster is restored in a consistent state: data on primary and backup nodes are equal, previously committed and rolled back transactions are fully and correctly completed on all nodes.

It could be achieved with PITR (point in time recovery), or frequent (incremental) snapshots, or using both approaches.

Current Ignite approaches

Currently there are some ways to stay on track, but all of them have disadvantages:

Using snapshot (as base point) + external queue (Kafka, for example) with inserted data or operations:
- Current snapshot process is pretty heavy as it relies on PME (Stop the World event), provide a load on IO (checkpointing, copying partition files) and RAM (using Copy-On-Write for updating pages) and time consuming => high cost for creating snapshots frequently.
- Additional cost for maintaining external persistent storage (queue) with all changes.
- User require to implement some part of failover (and restoring) logic and align other part of the solution with this logic.
Using hot stand-by implemented with CDC or reverse-proxy:
- High cost to guarantee consistency of production and stand-by clusters (it also requires to guarantee the same order of all operations on both clusters).
- Additional cost for maintaining second cluster and syncing clusters together.
- In case of failure main cluster, system is running on single instance. It's required to return failed cluster fast to be confident that system is failure prone. But creating snapshot on the second instance (to use it to start-up first one) under the load is risky and isn't recommended.
There is a default crash-recovery process for Ignite nodes: checkpoint + WAL + (historical) rebalance. There could be a way to make it work for snapshot (instead of checkpoint), but there are issues:
- Default restore process doesn't apply transactions from WAL - not enough info whether this transaction was committed or roll backed on other nodes. Instead, a node receives updates from other nodes during the rebalance process.
- WAL files are automatically cleaned (if unlimited size is not explicitly configured).

Proposed solution: Lightweight incremental snapshots

Ignite should provide:

a lightweight alternative to full snapshot creation (to avoid resources consumption)
a lightweight way to guarantee data consistency between nodes for such a snapshot (to avoid PME)

Solution is incremental snapshots based on the Consistent Cut algorithm and collecting WAL segments:

It provides lightweight creation (user can create them frequently to cover goal RPO):
- Consistent Cut is a runtime, non-blocking, background process. It doesn't require PME, doesn't make a pressure on disk or ram, and fast to create.
- Snapshot consist of WAL segments only. It's fast to create as it just collects archived segments, doesn't make a pressure on disk or ram, doesn't affect user's runtime operation (no COW, locks).
It guarantees transaction consistency:
- Consistent Cut writes special marks at runtime into WAL files. Algorithm guarantees that the collection of committed transactions (and data entries within transactions) prior such point is equal on every nodes within a cluster.
- Restoring is mostly node local process. It's safe to restore every Ignite node autonomously (without any message between nodes) to such a mark.
- Then it's faster than replaying all user updates from scratch, but slower than starting from full snapshot.

Cons and cost for the lightweight alternative:

RPO is good, but it still has a lower bound (up to several seconds, depending on cluster size, input load, transactions duration):
- Consistent Cut is discrete and guarantees consistency for batch of transactions, not for every transaction;
- it's required time for waiting while WAL segments are compacting and copying into snapshot directory.
RTO linearly depends on snapshot size.
Consistent Cut doesn't guarantee data consistency for Atomic caches. It requires an additional step by using ReadRepair.

Workflow

Prerequisites

Pre-existing full snapshot.
All WAL segments are presented since a previous snapshot (full or incremental).
WAL segments are configured for compaction (`DataStorageConfiguration#setWalCompactionEnabled = true`).
1. Snapshot doesn't need physical records for restoring. Compacting segments helps to optimize 1. space usage, 2. RTO by excluding excess info.

Consistent Cut

Algorithm description can be found on this page: Consistent Cut

Main ideas are related to incremental snapshots:

Cut can be consistent and inconsistent. It's prohibited to create a snapshot on inconsistent cut.
Restoring requires read WAL ahead for last incremental snapshot:
1. There are 2 records in WAL for every consistent cut: IncrementalSnapshotStartRecord and IncrementalSnapshotFinishRecord.
2. IncrementalSnapshotFinishRecord contains info which transactions before IncrementalSnapshotStartRecord has to be excluded from incremental snapshot.
3. Then it's important to read WAL ahead, reach IncrementalSnapshotFinishRecord and only after that apply entries since previous Incremental Snapshot.
In some circumstances it's impossible to create Incremental Snapshots anymore, full snapshot should be created (see below, limitations in Phase 1).
Only one instance of Incremental Snapshot can be created in one moment, concurrent process are not allowed.

Create an incremental snapshot

// Create incremental snapshot.
// SNP - name of pre-existing full snapshot.
// [--retries N] - amount of attempts to create incremental snapshot, in case of inconsistent cut. Default is 3.
$ control.sh --snapshot create SNP --incremental [ --retries N ]

Under the hood this command:

Makes checks:
1. Base snapshot (at least its metafile) exists. Exists metafile for all incremental snapshots.
2. Validate that no misses in WAL segments since previous snapshot. SnapshotMetadata should contain info about last WAL segment that contains snapshot data:
  1. If snapshot is full: ClusterSnapshotRecord is written to WAL, segment number of this record is stored within existing structure SnapshotMetadata.
  2. If snapshot is incremental: stored segment number is a segment that contains IncrementalSnapshotFinishRecord.
3. Check that baseline topology is the same (relatively to base snapshot).
4. Check that WAL is consistent (there was no disabling WAL since previous snapshot) - this info is stored into MetaStorage.
Starts a new Consistent Cut.
On finish Consistent Cut:
1. if cut is consistent: log IncrementalSnapshotFinishRecord with rolloverType=CURRENT_SEGMENT to enforce archiving the segment after logging the record.
2. if cut is inconsistent: skip log IncrementalSnapshotFinishRecord and retry since 1.
3. fail if retry attempts are exceeded.
Awaits the segment with IncrementalSnapshotFinishRecord has been archived and compacted.
Collects WAL segments for current incremental snapshot (from previous snapshot to IncrementalSnapshotFinishRecord).
Creates hardlinks to the compressed segments into target directory.
Writes a meta files with description of the new incremental snapshot:
1. meta.smf:
  1. Pointer to IncrementalSnapshotFinishRecord.
2. binary_meta, marshaller_data if it changed since previous snapshot.

# Proposed directory structure
$ ls $IGNITE_HOME
db/
snapshots/
|-- SNP/
|---- db/
|---- increments/
|------ 0000000000000001/
|-------- node0.smf
|-------- db/
|---------- binary_meta/
|---------- marshaller/
|-------- wals/
|---------- 0000000000000000.wal.zip

Restore process

// Restore cluster on specific incremental snapshot
$ control.sh --snapshot restore SNP --increment 1

With control.sh --snapshot restore command:

User specifies full snapshot name
Parses snapshot name and extracts base and incremental snapshots
After full snapshot restore processes (prepare, preload, cacheStart) has finished, it starts another DistributedProcess - `walRecoveryProc`:
1. Every node applies WAL segments since base snapshot while not reach requested IncrementalSnapshotFinishRecord.
2. Ignite should forbid concurrent operations (both read and write) for restored cache groups during WAL recovery.
3. Process of data applying for snapshot cache groups (from base snapshot) is similar to GridCacheDatabaseSharedManager logical restore:
  1. disable WAL for specified cache group
  2. find `ClusterSnapshotRecord` related to the base snapshot
  3. starts applying WAL updates with striped executor (cacheGrpId, partId). Apply filter for versions in ConsistentCutFinishRecord.
  4. enable WAL for restored cached groups
  5. force checkpoint and checking restore state (checkpoint status, etc).

Checking snapshot

// Check specific incremental snapshot
$ control.sh --snapshot check SNP --increment 1

With control.sh --snapshot check command:

Check includes following steps on every baseline node:

Check snapshot files are consistent:
1. Snapshot structure is valid and metadata matches actual snapshot files
2. all WAL segments are presented (from ClusterSnapshotRecord to requested IncrementalSnapshotFinishRecord).
Check snapshot incremental snapshot data integrity:
1. It parses WAL segments from the first incremental snapshot to the specified one (with --increment param).
2. For every partition it calculates hashes for entries, and for entry versions.
  1. On the reduce phase it compares partitions hashes between primary and backup copies.
3. For every pair of nodes that participated as primary nodes it calculates hash of committed transactions. For example:
  1. There are two transactions:
    1. TX1, and there are 2 nodes that participates in it as primary nodes: A and B
    2. TX2, and there are 2 nodes: A and C
  2. On node A it prepares 2 collections: TxHashAB = [hash(TX1)], TxHashAC = [hash(TX2)]
  3. On node B it prepares 1 collection: TxHashBA = [hash(TX1)]
  4. On node C it prepares 1 collection: TxHashCA = [hash(TX2)]
  5. On the reduce phase of the check it compares collections from all nodes and expects that:
    1. TxHashAB equals TxHashBA
    2. TxHashAC equals TxHashCA

Note that incremental snapshot doesn't check data of related full snapshot. Then full check of snapshot will consist of two steps:

Check full snapshot
Check incremental snapshot

Atomic caches

For Atomic caches it's required to restore data consistency (primary and backup nodes) differently, with ReadRepair feature. Consistent Cut relies on transaction protocol' messages (Prepare, Finish). Atomic caches protocol doesn't have enough messages to sync different nodes.

Restore process should suggest user perform an additional step if ATOMIC caches is restored:

Check partitions state with `idle_verify` command;
Start read-repair for non-consistent keys in lazy mode: on user get() operations related to broken cache keys.

Metrics

Incremental snapshot creation:
1. Creation time (== min possible RPO).
2. Amount of retries (due to cut inconsistency) while creating snapshot.
3. Consistent Cut time (time between writing ConsistentCutStartRecord and ConsistentCutFinishRecord).
4. Size of snapshot.
Restoring incremental snapshot:
1. Full snapshot restoring time, incl. restoring on base snapshot (== RTO).
2. Time of WAL restoring phase (part of snapshot restore).
3. Amount of parsed and restored WAL records / transactions / cache entries.

Snapshot System View

It's proposed to add new columns to snapshot system view: type of snapshot (full, incremental), name of base snapshot (for inremental)

class SnapshotView {
	String name;
	String consistentId;
	String baselineNodes;
	String cacheGrps;
	
	// New.
	String type;
	String baseSnp;
}

Phases

Phase 1 (MVP)

Goal of this phase is to deliver for user:

Command that allows to create an incremental snapshot basing on the specified base (full) snapshot.
Limitations:
1. IS creation fails if cache schema changed since base snapshot. Schemas are restored from full snapshot, while an incremental snapshot (IS) restores only data changes.
  1. Compare base snapshot cache_data.data with current cache info, fail if it has changed.
2. IS creation fails if a baseline node was rebalanced since base snapshot.
  1. Check rebalance fact for every cache group with `RebalanceFuture#isInitial()` on node start – it is null if joining node doesn't need to be rebalanced.
  2. This fact should be written to MetaStorage and checked before incremental snapshot (by analogue with GridCacheDatabaseSharedManager#isCheckpointInapplicableForWalRebalance).
3. IS creation fails if a baseline topology changed since base snapshot.
  1. Baseline topology is checked relatively to base snapshot.
4. IS creation fails if user tries to create it after restoring cluster on previously created incremental snapshot.
Command that allows to restore specified incremental snapshot.
Limitations:
1. Restoring on different topology is not allowed.
2. IS guarantees consistency for Transactional caches only. Ignite should write WARN into log with suggestion to run idle_verify check for Atomic caches, and restore them with ReadRepair if needed.
3. Does not protect cache groups from concurrent operations (both read, write), just WARN into log that restoring cache groups MUST be idle until operation finished.
Snapshot SystemView contains info about incremental snapshots.
Log messages with the Metrics info.

Phase 2

Restoring of incremental snapshot should be able to overcome WAL inconsistency caused by rebalance.
Improve the transaction recovery mechanism: recovery messages now are packed with ConsistentCutVersion, if it was set.
Strictly forbid concurrent operations while restoring.

Phase 3

Restoring of incremental snapshot should handle inconsistency of Atomic Caches.

Phase 4.

It's possible to restore incremental snapshot on different topology.

Links

Known solutions for creating backups ("snapshot" in terms of Ignite) in other databases:

Consistent backup + continuous WAL archiving + MVCC (hybrid clocks)
1. Cockroach (incremental backups + PITR).
  1. https://www.cockroachlabs.com/docs/stable/architecture/storage-layer.html#mvcc
  2. https://www.cockroachlabs.com/docs/v22.1/take-backups-with-revision-history-and-restore-from-a-point-in-time
2. Yugabyte (full backup + PITR). Backups are created with a command, or by built-in schedule.
  1. https://docs.yugabyte.com/preview/architecture/transactions/transactions-overview/#mvcc
  2. https://docs.yugabyte.com/preview/manage/backup-restore/point-in-time-recovery/
3. Tarantool (incremental backups). User can reach continuous backup (PITR) by copying WAL files by self.
  1. https://www.tarantool.io/en/doc/latest/book/box/atomic/txn_mode_mvcc/
  2. https://www.tarantool.io/en/doc/latest/book/admin/backups/
4. Mongo (PITR):
  1. https://www.youtube.com/watch?v=quFheFrLLGQ
  2. https://github.com/wal-g/wal-g/blob/master/docs/MongoDB.md
5. HBase (incremental backups):
Inconsistent backups + continuous WAL archiving
1. Cassandra (incremental inconsistent backups + PITR). Cassandra restores keys (but not transactions) consistency after restoring using ReadRepair feature.
  1. https://docs.datastax.com/en/cassandra-oss/2.2/cassandra/operations/opsAboutSnapshots.html?hl=incremental%2Cbackup%2Cconsistency
  2. https://docs.datastax.com/en/opscenter/6.5/opsc/online_help/services/opscCommitlogBackups.html

DevList discussion

https://lists.apache.org/thread/n045p88c714wovy7cyolq1bbfpttc9yz

Tickets

type	key	summary	assignee	reporter	priority	status	resolution	created	updated	due
JQL and issue key arguments for this macro require at least one Jira application link to be configured

Page tree

Motivation

Current Ignite approaches

Proposed solution: Lightweight incremental snapshots

Workflow

Prerequisites

Consistent Cut

Create an incremental snapshot

Restore process

Checking snapshot

Atomic caches

Metrics

Snapshot System View

Phases

Phase 1 (MVP)

Phase 2

Phase 3

Phase 4.

Links

DevList discussion

Tickets

6 Comments

Nikolay Izhikov

Maksim Timonin

Nikolay Izhikov

Nikolay Izhikov

Nikolay Izhikov

Maksim Timonin

Page tree

IEP-89: Incremental Snapshots

Motivation

Current Ignite approaches

Proposed solution: Lightweight incremental snapshots

Workflow

Prerequisites

Consistent Cut

Create an incremental snapshot

Restore process

Checking snapshot

Atomic caches

Metrics

Snapshot System View

Phases

Phase 1 (MVP)

Phase 2

Phase 3

Phase 4.

Links

DevList discussion

Tickets

6 Comments

Nikolay Izhikov

Maksim Timonin

Nikolay Izhikov

Nikolay Izhikov

Nikolay Izhikov

Maksim Timonin