Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • part-1.bin, part-2.bin (shown as P1,P2,PN at picture) - Cache partition pages.
  • index.bin - index partition data, special partition with number 65535 is used for SQL indexes and saved to index.bin
  • cache_data.dat - special file with stored cache data (configuration) . StoredCacheData includes more data than CacheConfiguration, e.g. Query entities

 

Persistence

...

Checkpointing

Can be of two types

...

Collection of pages (GridCacheDatabaseSharedManager.Checkpoint#cpPages) allows us to collect and then write pages which were changed since last checkpoint.

Checkpoint Pool

In parallel with process of writing pages to disk, some thread may want to update data in the page being written.

...

  • Percent of dirty pages is trigger for checkpointing (e.g. 75%).
  • Timeout is also trigger, do checkpoint every N seconds

...

WAL

We can’t control moment when node crashes. Let's suppose we have saved tree leafs, but didn’t save tree root (during pages allocation they may be reordered because allocation is multithread). In this case all updates will be lost.

...

Operation is acknowleged after operation was logged, and page(s) update was logged. Checkpoint will be started later by its triggers.

  

Crash Recovery

 

Local Crash Recovery

Crash Recovery can be 

  • Local (most DB are able to do this)
  • and distributed (whole cluster state is restored).

...

Crash recovery involves following records writtent in WAL, it may be of 2 main types Logical & Physical

 Logical

...

records

    1. Operation description - which operation we want to do. Contains operation type (
  1. put
    1. create, update,
  2. remove
    1. delete) and (Key, Value, Version)  - DataRecord
    2. Transactional record - this record is marker of begin, prepare, commit, and rollback transactions - (TxRecord
    3. Checkpoint record - marker of begin checkpointing (CheckpointRecord)

Structure of data record:

Data record includes list of entries (entry operations). Each operation has cache ID, operation type, key and value. Operation type can be 

  • CREATE - first put in cache, contains key and value.
  • UPDATE - put in case for existing key, contains key and value.
  • DELETE - remove key, contains key only, value is absent.

Update and create always contain full value value. In the same time several updates of the same key within transaction are merged into one latest update.

Physical records

    1. Full page snapshot - record is issued for first page update after successfull checkpointing. Record is logged when page
    Physical records
    1. Full page snapshot - record is issued for first page update after successfull checkpointing. Record is logged when page state changes from 'clean' to 'dirty' state (PageSnapshot)
    2. Delta record - describes memory region change, page change. Subclass of PageDeltaRecord. Contains bytes changed in the page. e.g bytes 5-10 were changed to [...,]. Relatively small records for B+tree records

...

For particular cache entry update we log records in follwowing following order:

  1. logical record with change planned - DataRecord with several DataEntry (ies)
  2. page record:
    1. option: page changed by this update was initially clean, full page is loged - PageSnapshot,
    2. option: page was already modified, delta record is issued - PageDeltaRecord

...