ID	IEP-109
Author	Nikolay Izhikov
Sponsor	Maksim Timonin
Created	17.07.2023
Status	DRAFT

Motivation

IEP-43 introduces persistent caches snapshots feature. This feature highly adopted by Ignite users.

In-memory caches snapshots will simplify the following use-cases:

In-memory cluster restarts.
Version upgrade.
Disaster recovery.
DC/Hardware replacement.
Data motion.

Description

in-memory snapshots will resuse existing snapshot code when possible. So key design decisions stay the same. It assumed that reader knows and understand IEP-43. So design description will focus on difference on persistence and in-memory snapshot.

API

New optional flag --mode will be added.
Possible values are PERSISTENT, INMEMORY, BOTH.
Default value is PERSISTENT to keep existing behaviour of command.
When concurrent creation of in-memory and persistent snapshots will be implemented default value can be changed to BOTH.

Example:

create snapshot example

> ./control.sh --snapshot --create SNP --mode INMEMORY

Creation

1. PME guarantees to make sure of visibility of all dirty pages:

Only PME required for in-memory snapshots. We can set write listener during PME because no concurrent transactions allowed.

See:

PartitionsExchangeAware#onDoneBeforeTopologyUnlock
IgniteSnapshotManager#onDoneBeforeTopologyUnlock
SnapshotFutureTask#start

2. Storage unit:

In-memory caches store pages in configured DataRegion. Page for specific cache group allocated in some Segment of data region.

So, unlike persistent caches it more convinient and error-prone to create snapshot of the DataRegion with all caches in it.

During creation of snapshot node must track all page changes which can be implemented by the listener of write locks in PageMemoryNoStoreImpl.

3. Persistent pages contains CRC to make sure data integrity while storing on disk:

CRC for each page must be calculated and written to snapshot metadata during snapshotting.

CRC must be checked during restore.

4. Metadata:

StoredCacheData.
binary_meta.
marshaller.

must be properly prepared and saved during snapshot.

Restore

Prerequisites:

Restored data region is empty - there are no any caches stored in it.
Count of nodes in cluster are the same as in time of creation (this restriction can be eliminated in Phase 2 of IEP implementation).
All nodes in cluster has snapshot locally.

Steps:

Block data region exclusively on each node - any attempt of usage (cache creation) must be blocked.
Restore all saved data into data region.
Restore all saved metadata.
Wait all nodes complete step 2 and 3.
Start caches that belongs to restored data region.

Risks and Assumptions

DataRegionConfiguration#persistenceEnabled=false for in-memory caches by the definition.
The same value must be for DataRegionConfiguration when cache group restored from in-memory snapshot.
After this feature implemented PageIO will require to be backward compatible.
The way to restore snapshot on different topology must be further investigated.
Empty pages of DataRegion will be written to snapshot.
Compaction of snapshot must be further investigated.
No concurrent snapshot operation - persistent, in-memory allowed. This can be eliminated in next phases to provide the ability to create full cluster snapshot by one command.
In case of mixed cluster(both persistence and in-memory data region exists) metastorage is persistent and must be included into in-memory snapshot.

Discussion Links

// Links to discussions on the devlist, if applicable.

Reference Links

Tickets

key	summary	type	updated	assignee	customfield_12311032	customfield_12311037	customfield_12311022	customfield_12311027	priority	status	resolution
JQL and issue key arguments for this macro require at least one Jira application link to be configured

Page tree

IEP-109 Cluster in-memory caches snapshots