You are viewing an old version of this page. View the current version.

Compare with Current View Page History

Version 1 Next »

IDIEP-113
Author
Sponsor
Created

 

Status
DRAFT


Motivation

Currently, to recovery from failures during checkpoints we use physycal records in the WAL. Physical records take most of the WAL size, required only for crash recovery and these records are useful only for a short period of time (since previous checkpoint).
Size of physical records during checkpoint is more than size of all modified pages between checkpoints, since we need to store page snapshot record for each modified page and page delta records, if page is modified more than once between checkpoints.
We process WAL file several times in stable workflow (without crashes and rebalances):

  1. We write records to WAL files
  2. We copy WAL files to archive
  3. We compact WAL files (remove phisical records + compress)

So, totally, we write all physical records twice and read physical records at least twice (if historical rebalance is in progress, physical records can be read more than twice).

Looks like we can get rid of redundant reads/writes and reduce disk workload.

Description

Why do we need physical records in WAL?

A page write that is in process during an operating system crash might be only partially completed, leading to an on-disk page that contains a mix of old and new data. The row-level change data normally stored in WAL will not be enough to completely restore such a page during post-crash recovery. For half written pages protection we need to have a persisted copy of the written page somewhere for recovery purposes.

After modification of the data page we don't write changes to disk immediately, instead, we have dedicated checkpoint process, which starts periodically and flushes all modified (dirty) pages to the disk. This approach is commonly used by other DBMSs. Taking this into account we have to persist somewhere at least a copy of each page participating in checkpoint process until checkpoint complete. Currently, WAL physical records are used to solve this problem.

Other vendors solutions

How other vendors deal with the same problem:

  • PostgreSQL uses full page images in WAL. PostgreSQL server writes the entire content of each disk page to WAL during the first modification of that page after a checkpoint. Storing the full page image guarantees that the page can be correctly restored, but at the price of increasing the amount of data that must be written to WAL. Future updates to that page can be done with the usual and much cheaper log records, right up until the next checkpoint. On crash recovery WAL replay starts from a checkpoint, full page images restore the page directly into the buffer pool and othe log records apply changes to the restored pages. [1] 
  • MySQL InnoDB storage engine uses doublewrite buffer. The doublewrite buffer is a storage area where InnoDB writes pages flushed from the buffer pool before writing the pages to their proper positions in the InnoDB data files. If there is an operating system failure, storage subsystem failure, or unexpected mysqld process exit in the middle of a page write, InnoDB can find a good copy of the page from the doublewrite buffer during crash recovery. Data is written to the doublewrite buffer in a large sequential chunk, with a single fsync() call to the operating system. [2]
  • Oracle as a commercial RDBMS do not attempt to protect against block write failures on every checkpoint. Commercial RDBMSs were developed for high-end storage that would make guarantees about power-loss atomicity (battery backed controller cards, uninterruptible power supply, etc). But Oracle DB still has some block recovery technics. If corrupted block encountered, Oracle tries to restore block copy from the replica database instance or from RMAN backup, and apply WAL records to make block up-to-date. [3]

Main proposal 

Our current approach is similar to PostgreSQL approach, except the fact, that we strictly divide page level WAL records and logical WAL record, but PostgreSQL uses mixed physical/logical WAL records for 2nd and next updates of the page after checkpoint.

To provide the same crash recovery guarantees we can change the approach and, for example, write modified pages twice during checkpoint. First time to some checkpoint recovery file sequentially and second time to the page storage. In this case we can restore any page from recovery file (instead of WAL, as we do now) if we crash during write to page storage. This approach is similar to doublewrite buffer used by MySQL InnoDB engine.

Changes in checkpoint process

On checkpoint, pages to be written are collected under checkpoint write lock (relatively short perieod of time). After write lock is released checkpoint start marker is stored to disk, pages are written to page store (this takes relatively long period of time) and finally checkpoint end marker is stored to disk. On recovery, if we found checkpoint start marker without corresponding checkpoint end marker we know that database is crashed on checkpoint and data in page store can be corrupted.

What need to be changed: after checkpoint write lock is released, but before checkpoint start marker is stored to disk we should write checkpoint recovery data for each checkpoint page. This data can be written multithreaded to different files, using the same thread pool as we use for writing pages to page store. Checkpoint start marker on disk guarantees that all recovery data are already stored and we can proceed to writing to page store.

Changes in crash recovery process

We have two phases of recovery after crash: binary memory restore and logical records apply. Binary memory restore required only if we crashed during last checkpoint. Only pages, affected by last checkpoint are required to be restored. So, currently, to perform this phase we replay WAL from the previous checkpoint to the last checkpoint and apply page records (page snapshot records or page delta records) to page memory. After binary page memory restore we have consistant page memory and can apply logical records starting from last checkpoint, to make database up-to-date.

What need to be changed: on binary memory restore phase, before trying to restore from WAL files we should try to restore from checkpoint recovery files, if these files are found. We still need to replay WAL starting from previous checkpoint, because some other records, except page snapsot and page deltas, need to be applied.  

Compression and encryption

  • We can use page compression and page compaction to reduce size of checkpoint recovery files. Since we write checkpoint recovery files in background, out of the path of cache operations, page compression will not affect latancy of these operations  Page compaction can be used without external dependencies, so it can be enabled by default.
  • We should use encryption to protect saved to checkpoint recovery files page records if encryption on cache for this page is enabled.

Compatibility

To provide compatibility we can have implementation of both recovery mechanisms in code (current and the new one), and allow user to configure it. In the next release we can use physical WAL records by default, in the following release we can switch default to checkpoint recovery file. On recovery Ignite can decide what to do by analyzing files for current checkpoint. 

Pros and cons

Pros:

  • Less size of stored data (we don't store page delta files, only final state of the page)
  • Reduced disc workload (we write all modified pages once instead of 2 writes and 2 reads of larger amount of data)
  • Potentially reduced latency (instead of writing physical records synchronously during data modification we write to WAL only logical records and physical pages will be written by checkpointer threads in background)

Cons:

  • Increased checkpoint duration (we should write doubled amount of data during checkpoint)

Alternative proposal

Alternatively, we can write physical records (page snapshots and page delta records) the same way as we do now, but use different files for physical and logical WALs. In this case there will be no redundant reads/writes of physical records (physical WAL will not be archived, but will be deleted after checkpoint). This approach will reduce disk worload and don't increase checkpoint duration, but still extra data is required to be written as page delta records for each page modification and physical records can't be written in background.

Benchmarks

TBD

Risks and Assumptions

Longer checkpoint time can lead to write throttling or even OOM if insufficient checkpoint buffer size is configured. 

Discussion Links

// Links to discussions on the devlist, if applicable.

Reference Links

[1] https://wiki.postgresql.org/wiki/Full_page_writes

[2] https://dev.mysql.com/doc/refman/8.0/en/innodb-doublewrite-buffer.html

[3] https://docs.oracle.com/en/database/oracle/oracle-database/19/bradv/rman-block-media-recovery.html

Tickets

key summary type created updated due assignee reporter priority status resolution fixVersion

JQL and issue key arguments for this macro require at least one Jira application link to be configured

  • No labels