Table of Contents:
Table of Contents |
---|
Crash Recovery can be
Ignite Durable Memory is basis for all data structures. There is no cache state saved on heap now.
To save cache state to disk we can dump all its pages to disk. First prototypes used this simple approach: stop all updates and save all pages.
We can’t control moment when node crashes. Also we can’t save all memory pages to disk each time - it is too slow. Updates should be incremental.
Let's suppose we have saved tree leafs, but didn’t save tree root (during pages allocation they may be reordered because allocation is multithread). In this case all updates will be lost.
Technique to solve this named write ahead logging: Before doing actual update, we append planned change information into cyclic file named WAL log (operation name - WAL append/WAL log).
After crash we can read and replay WAL using already saved page set. We can restore to state, which was last committed state of crashed process. Restore bases on pages set + WAL. See also WAL structure section below
Practically we can’t replay WAL from the beginning of times, Volume(HDD)<Volume(full WAL), and we need procedure to throw out oldest part of changes in WAL.
This procedure is named checkpointing
Can be of two types
Implemented - Sharp Checkpoint; F.C. - todo
To achieve consistency Checkpoint read-write lock is used (see GridCacheDatabaseSharedManager#checkpointLock)
Under CP write lock held we do the following:
And then CP write lock is released, updates and transactions can run.
Dirty pages is set, when page from non-dirty becomes dirty, it is added to this set.
Collection of pages (GridCacheDatabaseSharedManager.Checkpoint#cpPages) allows us to collect and then write pages which were changed since last checkpoint.
In parallel with process of writing pages to disk, some thread may want to update data in the page being written.
For such case Checkpoint pool is used for pages being updated in parallel with write. This pool has limitation.
Copy on write technique is used. If there is modification in page which is under CP now we will create temporary copy of page.
...
Triggers
WAL file segments and rotation structure
See also WAL history size section below
Crash recovery involves following records writtent in WAL, it may be of 2 main types
For particular update we write records in follwowing order
Possible future optimisation - refer data modified from PageDeltaRecord to logical record. Will allow to not store byte updates twice. We have file wal pointer, pointer to record from the beginning of time. This refreence may be used.
Let’s assume node start.
There are 2 markers on disc (standalone files, includes timestamp and WAL pointer):
If there is only CP begin and no CP end => CP not finished; we have not consistent page store.
Let’s suppose we discover markers for CP1 and 2 start and CP 1 end.
For completed checkpoint we apply only physical records, for incomplete - only logical (as physical may be corrupted).
Page Snapshot records required to avoid double apply of data from delta records.
When replay is finished CP2 marker will be added.
If transaction begin record has no corresponding end, tx change is not applied.
There are next file types for DB
Consistent state comes only from pair of WAL and page store.
Because CP are consistent we can’t start next CP until previous is not completed.
There is possible next situation:
For that case we will block new updates and wait running for CP to finish.
To avoid such scenario:
WAL and page store may be saved to different devices to avoid its mutual influence.
Case if same records are updated many times may generate load to WAL and no significant load to page store.
To provide recovery guarantees each write (log()) to WAL should:
fsync is expensive operation. There is optimisation for case updates coming faster than disk write, fsyncDelayNanos (1ns-1ms, 1ns by default) delay is used. This delay is used to park threads to accumulate more than one fsync requests.
Future optimisation: standalone thread will be responsible to write data to disk. Worker threads will do all preparation and transfer buffer to write.
See also WAL history size section below.
There several levels of guarantees (WALMode)
...
...
...
write() on commit
Synchronisation is responsibility of OS
...
do nothing on commit
write() on timeout
...
But there is several nodes containing same data and there is possible to restore data from other nodes.
Partition update counter. This mechanism was already used in continuous queries.
Each update (counter) is replicated to backup. If counter equal on primary and backup means replication is finished.
Partition update counter is saved with update recods in WAL.
Consider partition on joining node was is owning state, update counter = 50. Existing nodes has update counter = 150
Node join causes partition map exchange, update counter is sent with other partition data. (Joining node will have new ID and from the point of view of dicsovery this node is a new node.)
Coordinator observes older partition state and forces partition to moving state. Moving force is required to setup uploading newer data.
Rebalance of fresh data to joined node now may be run in 2 modes:
Possible future optimisation: for full update we may send page store file over network.
In corner case we need to store WAL only for 1 checkpoint in past for successful recovery (PersistentStoreConfiguration#walHistSize )
We can’t delete WAL segments considering only history size in bytes or segments. It is possible to replay WAL only starting from checkpoint marker.
WAL history size is measured in number of checkpoint.
Assuming that checkpoints are triggered mostly by timeout we can estimate possible downtime after which node may be rebalanced using delta logical WAL records.
By default WAL history size is 20 to increase probability that rebalancing can be done using logical deltas from WAL.
Notes:
Following notation is used: words written in italics and wihout spacing mean class name without package name or method name, for example, GridCacheMapEntry
Table of Contents:
...
Table of Contents |
---|
...
Crash Recovery
...
...
Crash Recovery can be
...
Page Ignite Durable Memory is basis for all data structures. There is no cache state saved on heap now.
To save cache state to disk we can dump all its pages to disk. First prototypes used this simple approach: stop all updates and save all pages.
We can’t control moment when node crashes. Also we can’t save all memory pages to disk each time - it is too slow. Updates should be incremental.
...
This procedure is named checkpointing
Can be of two types
Implemented - Sharp Checkpoint; F.C. - todo
To achieve consistency Checkpoint read-write lock is used (see GridCacheDatabaseSharedManager#checkpointLock)
Under CP write lock held we do the following:
And then CP write lock is released, updates and transactions can run.
...
Collection of pages (GridCacheDatabaseSharedManager.Checkpoint#cpPages) allows us to collect and then write pages which were changed since last checkpoint.
In parallel with process of writing pages to disk, some thread may want to update data in the page being written.
...
Copy on write technique is used. If there is modification in page which is under CP now we will create temporary copy of page.
Triggers
WAL file segments and rotation structure
...
See also WAL history size section below
Crash recovery involves following records writtent in WAL, it may be of 2 main types
...
For particular update we write records in follwowing order
...
...
Possible future optimisation - refer data modified from PageDeltaRecord to logical record. Will allow to not store byte updates twice. We have file wal pointer, pointer to record from the beginning of time. This refreence may be used.
Let’s assume node start process is running with existent files.
There are 2 Ignite manages 2 types of CP markers on disc disk (standalone files, includes timestamp and WAL pointer):
If there is we observe only CP begin and there is no CP end => marker that means CP not finished; we have not consistent page store.
...
If transaction begin record has no corresponding end, tx change is not applied.
There are next file types for DB
Consistent state comes only from pair of WAL and page store.
Because CP are consistent we can’t start next CP until previous is not completed.
There is possible next situation:
For that case we will block new updates and wait running for CP to finish.
To avoid such scenario:
WAL and page store may be saved to different devices to avoid its mutual influence.
...
To provide recovery guarantees each write (log()) to WAL should:
fsync is expensive operation. There is optimisation for case updates coming faster than disk write, fsyncDelayNanos (1ns-1ms, 1ns by default) delay is used. This delay is used to park threads to accumulate more than one fsync requests.
...
See also WAL history size section below.
There several levels of guarantees (WALMode)
Implementation | Waranny | |
---|---|---|
DEFAULT | fsync() on each commit | Any crashes (OS and process crash) |
LOG_ONY | write() on commit Synchronisation is responsibility of OS | Kill process, but no OS fail |
BACKGROUND | do nothing on commit write() on timeout | kill -9 may cause loss of several latest updates |
But there is several nodes containing same data and there is possible to restore data from other nodes.
Partition update counter. This mechanism was already used in continuous queries.
Each update (counter) is replicated to backup. If counter equal on primary and backup means replication is finished.
Partition update counter is saved with update recods in WAL.
Consider partition on joining node was is owning state, update counter = 50. Existing nodes has update counter = 150
...
Rebalance of fresh data to joined node now may be run in 2 modes:
Possible future optimisation: for full update we may send page store file over network.
In corner case we need to store WAL only for 1 checkpoint in past for successful recovery (PersistentStoreConfiguration#walHistSize )
...