Following notation is used: words written in italics and wihout spacing mean class name without package name or method name, for example, GridCacheMapEntry
Table of Contents:
There are following file types used for persisting data: Cache pages or page store, Checkpoint markers, and WAL segments
Ignite with enabled persistence uses following folder stucture
2.3+ | Older versions (2.1 & 2.2) |
---|---|
Pst-subfolder name is same for all storage folders. Name is selected on start, may be based on node consistentId |
Subfolder name is generated on start. By default new style naming is used, for example node00-e819f611-3fb9-4dbe-a3aa-1f6de4af5d02
Option 1. For case ignite is started for clear persistence storage root folder, this naming is used
Option 2. This option is used in case there is existing pst-subfolder with exact same name with compatible consistent ID (loca host IPs and ports list). If there is such folder, Ignite is started using this one, consistent ID is not changed.
Option 3. This option is applied in case there is preconfigured value from IgniteConfiguration
In case there is old style folder, but its name doesn't match with compatible consistent ID, following warning is generated.
There is other non-empty storage folder under storage base directory [work\db\127_0_0_1_49999, 299718 bytes, modified 10/04/2017 04:33 PM ]
There are two file locks used in folders selection.
First one is used to check if there is no up and running node which is using same directory.
This lock is placed in work/db/{pst-subfolder}/lock (work/db may be still customized by storage folder property)
Second lock is placed in storage root folder: work/db/lock. This lock is held for short time when new pst-subfolder is being created. This protects from concurrent folder intialisation by nodes which are starting simultaneouly.
Exact generation algorithm and code references:
Ignite Durable Memory is basis for all data structures. There is no cache state saved on heap now.
To save cache state to disk we can dump all its pages to disk. First prototypes used this simple approach: stop all updates and save all pages.
Page store is the storage for all pages related to particual cache (cache's partitions and SQL indexes).
Using page identifier it is possible to map from page ID to file and to position in particular file:
pageId = ... || partition ID || page index (idx) //pageId can be easily converted to file + offset in this file offset = idx * pageSize
Partitions of each cache have corresponding file in page store directory (particular node may own not all partitions).
Each cache has corresponding folder in page store (cache-(cache-name). And each owning (or backup) partition of this cache has its related file.
Cache page storage contains following files:
Checkpointing can be defined as process of storing dirty pages from RAM on a disk, with results of consistent memory state is saved to disk. At the point of process end, page state is saved as it was for the time the process begins.
There are two approaches to implementation of checkpointing:
Approach implemented in Ignite - Sharp Checkpoint; F.C. - to be done in future releases.
To achieve consistency checkpoint read-write lock is used (see GridCacheDatabaseSharedManager#checkpointLock)
Under checkpoint write lock held we do the following:
And then checkpoint write lock is released, updates and transactions can run.
Dirty pages is set, when page from non-dirty becomes dirty, it is added to this set.
Collection of pages (GridCacheDatabaseSharedManager.Checkpoint#cpPages) allows us to collect and then write pages which were changed since last checkpoint.
In parallel with process of writing pages to disk, some thread may want to update data in the page being written (or scheduled to being written).
For such case Checkpoint pool is used for pages being updated in parallel with write. This pool has limitation.
Copy on write technique is used. If there is modification required in page which is under checkpoint, we will create temporary copy of this page in checkpoint pool.
To perform write to a dirty page scheduled to be checkpointed followin is done:
was not involved into checkpoint initially, and is updated concurrenly with checkpointing process:
Triggers
We can’t control moment when node crashes. Let's suppose we have saved tree leafs, but didn’t save tree root (during pages allocation they may be reordered because allocation is multithread). In this case all updates will be lost.
In the same time we can’t translate each memory page update to disk each time - it is too slow.
Technique to solve this named write ahead logging: Before doing actual update, we append planned change information into cyclic file named WAL log (operation name - WAL append/WAL log).
After crash we can read and replay WAL using already saved page set. We can restore to state, which was last committed state of crashed process. Restore is based on pages store + WAL.
Practically we can’t replay WAL from the beginning of times, Volume(HDD)<Volume(full WAL), and we need procedure to throw out oldest part of changes in WAL, and this is done during checkpointing.
Consistent state comes only from pair of WAL and page store.
Operation is acknowleged after operation was logged, and page(s) update was logged. Checkpoint will be started later by its triggers.
Crash Recovery can be
Crash recovery involves following records writtent in WAL, it may be of 2 main types Logical & Physical
Structure of data record:
Data record includes list of entries (entry operations). Each operation has cache ID, operation type, key and value. Operation type can be
Update and create always contain full value value. In the same time several updates of the same key within transaction are merged into one latest update.
Page snapshots and related deltas are combined during WAL replay.
For particular cache entry update we log records in following order:
Planned future optimisation - refer data modified from PageDeltaRecord to logical record. Will allow to not store byte updates twice. There is file WAL pointer, pointer to record from the beginning of time. This refreence may be used.
WAL file segments and rotation structure
See also WAL history size section below
Let’s assume node start process is running with existent files.
Ignite manages 2 types of CP markers on disk (standalone files, includes timestamp and WAL pointer):
If we observe only CP begin and there is no CP end marker that means CP not finished; we have not consistent page store.
For crash at the moment when there was no checkpoint process running restore is trivial, only logical record are applied.
Physical records (page snapshots and delta records) are ignored because page store is consistent.
Let’s suppose crash occurred at the middle of checkpoint. In that case restore process will discover markers for 1 CP1 and 2 start and CP 1 end.
Restore is split to 2 stages:
1st stage: Starting from previous completed checkpoint record CP1 till record of CP2 start (incompleted) we apply all physical records and ignore logical.
2nd stage: Starting from marker of incomplete CP2 we apply only logical records until end of WAL is reached.
When replay is finished CP2 end marker will be added.
If transaction begin record has no corresponding end, tx change is not applied.
Because CP are consistent we can’t start next CP until previous is not completed.
There is possible next situation:
For that case we will block new updates and wait running for CP to finish.
To avoid such scenario:
WAL and page store may be saved to different devices to avoid its mutual influence.
Case if same records are updated many times may generate load to WAL and no significant load to page store.
To provide recovery guarantees each write (log()) to WAL should:
fsync is expensive operation. There is optimisation for case updates coming faster than disk write, fsyncDelayNanos (1ns-1ms, 1ns by default) delay is used. This delay is used to park threads to accumulate more than one fsync requests.
Future optimisation: standalone thread will be responsible to write data to disk. Worker threads will do all preparation and transfer buffer to write.
See also WAL history size section below.
There several levels of guarantees (WALMode)
Implementation | Warranties | |
---|---|---|
DEFAULT | fsync() on each commit | Any crashes (OS and process crash) |
LOG_ONY | write() on commit Synchronisation is responsibility of OS | Kill process, but no OS fail |
BACKGROUND | do nothing on commit (records are accumulated in memory) write() on timeout | kill -9 may cause loss of several latest updates |
But there is several nodes containing same data and there is possible to restore data from other nodes.
Partition update counter. This mechanism was already used in continuous queries.
Each update (counter) is replicated to backup. If counter equal on primary and backup means replication is finished.
Partition update counter is saved with update recods in WAL.
Consider partition on joining node was is owning state, update counter = 50. Existing nodes has update counter = 150
Node join causes partition map exchange, update counter is sent with other partition data. (Joining node will have new ID and from the point of view of dicsovery this node is a new node.)
Coordinator observes older partition state and forces partition to moving state. Moving force is required to setup uploading newer data.
Rebalance of fresh data to joined node now may be run in 2 modes:
Possible future optimisation: for full update we may send page store file over network.
Order of nodes join is not relevant, there is possible situation that oldest node has older partition state, but joining node has higher partition counter. In this case rebalancing will be triggered by coordinator. Rebalancing will be performed from the newly joined node to existing one (note this behaviour may be changed under IEP-4 Baseline topology for caches)
In corner case we need to store WAL only for 1 checkpoint in past for successful recovery (PersistentStoreConfiguration#walHistSize )
We can’t delete WAL segments considering only history size in bytes or segments. It is possible to replay WAL only starting from checkpoint marker.
WAL history size is measured in number of checkpoint.
Assuming that checkpoints are triggered mostly by timeout we can estimate possible downtime after which node may be rebalanced using delta logical WAL records.
By default WAL history size is 20 to increase probability that rebalancing can be done using logical deltas from WAL.