ID | IEP-113 |
Author | |
Sponsor | |
Created |
|
Status | DRAFT |
Currently, to recovery from failures during checkpoints we use physycal records in the WAL. Physical records take most of the WAL size, required only for crash recovery and these records are useful only for a short period of time (since previous checkpoint).
Size of physical records during checkpoint is more than size of all modified pages between checkpoints, since we need to store page snapshot record for each modified page and page delta records, if page is modified more than once between checkpoints.
We process WAL file several times in stable workflow (without crashes and rebalances):
So, totally, we write all physical records twice and read physical records at least twice (if historical rebalance is in progress, physical records can be read more than twice).
Looks like we can get rid of redundant reads/writes and reduce disk workload.
A page write that is in process during an operating system crash might be only partially completed, leading to an on-disk page that contains a mix of old and new data. The row-level change data normally stored in WAL will not be enough to completely restore such a page during post-crash recovery. For half written pages protection we need to have a persisted copy of the written page somewhere for recovery purposes.
After modification of the data page we don't write changes to disk immediately, instead, we have dedicated checkpoint process, which starts periodically and flushes all modified (dirty) pages to the disk. This approach is commonly used by other DBMSs. Taking this into account we have to persist somewhere at least a copy of each page participating in checkpoint process until checkpoint complete. Currently, WAL physical records are used to solve this problem.
How other vendors deal with the same problem:
Our current approach is similar to PostgreSQL approach, except the fact, that we strictly divide page level WAL records and logical WAL record, but PostgreSQL uses mixed physical/logical WAL records for 2nd and next updates of the page after checkpoint.
To provide the same crash recovery guarantees we can change the approach and, for example, write modified pages twice during checkpoint. First time to some checkpoint recovery file sequentially and second time to the page storage. In this case we can restore any page from recovery file (instead of WAL, as we do now) if we crash during write to page storage. This approach is similar to doublewrite buffer used by MySQL InnoDB engine.
On checkpoint, pages to be written are collected under checkpoint write lock (relatively short perieod of time). After write lock is released checkpoint start marker is stored to disk, pages are written to page store (this takes relatively long period of time) and finally checkpoint end marker is stored to disk. On recovery, if we found checkpoint start marker without corresponding checkpoint end marker we know that database is crashed on checkpoint and data in page store can be corrupted.
What need to be changed: after checkpoint write lock is released, but before checkpoint start marker is stored to disk we should write checkpoint recovery data for each checkpoint page. This data can be written multithreaded to different files, using the same thread pool as we use for writing pages to page store. Checkpoint start marker on disk guarantees that all recovery data are already stored and we can proceed to writing to page store.
We have two phases of recovery after crash: binary memory restore and logical records apply. Binary memory restore required only if we crashed during last checkpoint. Only pages, affected by last checkpoint are required to be restored. So, currently, to perform this phase we replay WAL from the previous checkpoint to the last checkpoint and apply page records (page snapshot records or page delta records) to page memory. After binary page memory restore we have consistant page memory and can apply logical records starting from last checkpoint, to make database up-to-date.
What need to be changed: on binary memory restore phase, before trying to restore from WAL files we should try to restore from checkpoint recovery files, if these files are found. We still need to replay WAL starting from previous checkpoint, because some other records, except page snapsot and page deltas, need to be applied.
To provide compatibility we can have implementation of both recovery mechanisms in code (current and the new one), and allow user to configure it. In the next release we can use physical WAL records by default, in the following release we can switch default to checkpoint recovery file. On recovery Ignite can decide what to do by analyzing files for current checkpoint.
Pros:
Cons:
Alternatively, we can write physical records (page snapshots and page delta records) the same way as we do now, but use different files for physical and logical WALs. In this case there will be no redundant reads/writes of physical records (physical WAL will not be archived, but will be deleted after checkpoint). This approach will reduce disk worload and don't increase checkpoint duration, but still extra data is required to be written as page delta records for each page modification and physical records can't be written in background.
The goal of this test is to measure maximum throughput in cases without bottlenecks such as disk overload influence or throttling due to checkpoint buffer overflow.
Test parameters:
Parameter | Value |
---|---|
Server nodes count | 6 |
Client nodes count | 6 |
Range (unique keys) | 1_000_000 |
Data region size | 1Gb |
Checkpoint period | 30s |
Backups count | 1 |
Warmup | 60s |
Duration after warmup | 600s |
Benchmark results:
Parameter | Atomic puts WAL | Atomic puts CP recovery | Implicit Tx puts WAL | Implicit Tx puts CP recovery | Implicit Tx puts FSYNC WAL mode WAL | Implicit Tx puts FSYNC WAL mode CP recovery |
---|---|---|---|---|---|---|
Throughput (rps) | 799 878.67 | 831 060.07 | 360 502.78 | 380 029.78 | 70 865.22 | 73 340.22 |
Latency (ms) | 0.3524 | 0.3378 | 0.8058 | 0.7644 | 4.1272 | 3.9844 |
WAL size (avg per node, bytes) | 41 112 562 156 | 22 269 185 433 | 32 076 973 840 | 24 089 536 023 | 6 839 126 885 | 4 676 597 433 |
Checkpoint parameters (avg per node): | ||||||
Total CP pages written | 153 971 | 152 661 | 150 967 | 148 984 | 160 175 | 161 042 |
Total CP recovery data size (bytes) | 583 399 088 | 568 974 574 | 603 859 277 | |||
Total CP recovery data write duration (ms) | 2 375 | 2 383 | 1 931 | |||
Total CP pages write duration (ms) | 1 787 | 1 784 | 1 829 | 1 858 | 1 416 | 1 368 |
Total CP fsync duration (ms) | 864 | 811 | 909 | 825 | 870 | 820 |
Total CP duration (ms) | 5 771 | 6 463 | 5 088 | 6 606 | 2 670 | 4 420 |
The goal of this test is to measure disk usage and checkpoint parameters on tests with the same workload.
Test parameters:
Parameter | Value |
---|---|
Server nodes count | 6 |
Client nodes count | 6 |
Range (unique keys) | 1_000_000 |
RPS | 60_000 |
Data region size | 1Gb |
Checkpoint period | 30s |
Backups count | 1 |
Warmup | 60s |
Duration after warmup | 600s |
Benchmark results:
Parameter | Atomic puts WAL | Atomic puts CP recovery | Atomic puts CP recovery compression DISABLED | Atomic puts CP recovery compression SNAPPY | Implicit Tx puts WAL | Implicit Tx puts CP recovery | Implicit Tx puts FSYNC WAL mode WAL | Implicit Tx puts FSYNC WAL mode CP recovery |
---|---|---|---|---|---|---|---|---|
Throughput (rps) | 60 000 | 60 000 | 60 000 | 60 000 | 60 000 | 60 000 | 60 000 | 60 000 |
Latency (ms) | 0.1890 | 0.1877 | 0.1883 | 0.1871 | 0.4461 | 0.4414 | 2.2584 | 1.8918 |
WAL size (avg per node, bytes) | 3 695 334 900 | 1 610 914 189 | 1 610 921 067 | 1 610 914 009 | 5 921 194 631 | 3 828 517 908 | 5 913 043 435 | 3 828 528 343 |
Checkpoint parameters (avg per node): | ||||||||
Total CP pages written | 158 714 | 165 068 | 165 540 | 165 833 | 168 161 | 165 078 | 163 835 | 163 489 |
Total CP recovery data size (bytes) | 618 070 915 | 681 737 645 | 225 645 987 | 615 981 504 | 602 789 507 | |||
Total CP recovery data write duration (ms) | 1 782 | 1 930 | 927 | 1 873 | 1 943 | |||
Total CP pages write duration (ms) | 1 262 | 1 214 | 1 208 | 1 224 | 1 368 | 1 293 | 1 358 | 1 400 |
Total CP fsync duration (ms) | 790 | 802 | 811 | 795 | 849 | 830 | 861 | 889 |
Total CP duration (ms) | 2 454 | 4 079 | 4 253 | 3 225 | 2 682 | 4 322 | 2 589 | 4 541 |
In this test 1_500_000_000 unique keys inserted into the cache during 2 hours. This test assumes high disk and checkpoint buffers usage.
Test parameters:
Parameter | Value |
---|---|
Server nodes count | 6 |
Client nodes count | 6 |
Range (unique keys) | 1_500_000_000 |
Data region size | 40Gb |
Checkpoint buffer size | 10Gb |
Checkpoint period | 30s |
Backups count | 0 |
Duration | 7200s |
Benchmark results:
Parameter | Atomic puts WAL | tomic puts CP recovery compression SNAPPY | Atomic puts CP recovery |
---|---|---|---|
Throughput (rps) | 349 414 | 731 955 | 667 503 |
Latency (ms) | 0.8212 | 0.3796 | 0.4095 |
WAL size (avg per node, bytes) | 1 223 677 262 243 | 107 160 078 079 | 97 724 078 037 |
Checkpoint parameters (avg per node): | |||
Total CP pages written | 154 280 598 | 402 636 438 | 363 780 153 |
Total CP buffer pages used | 10 598 937 | 240 644 447 | 231 946 861 |
Total CP recovery data size (bytes) | 719 657 813 646 | 1 413 758 409 104 | |
Total CP recovery data write duration (ms) | 2 669 067 | 3 392 610 | |
Total CP pages write duration (ms) | 1 328 328 | 4 004 378 | 3 407 234 |
Total CP fsync duration (ms) | 843 706 | 40 309 | 33 854 |
Total CP duration (ms) | 2 219 940 | 6 801 086 | 6 906 536 |
Here are charts related to the benchmark (on the left - write recovery data in WAL physical records, in the center - write recovery data on checkpoint with compression SNAPPY, on the right - write recovery data on checkpoint with default SKIP_GARBAGE compression):
Latency and throughput chart (for batch of 100 puts)
Checkpoint buffer usage chart:
Checkpoint pages number chart:
Throttling:
Disk usage:
As we can see in these tests storing recovery data on checkpoint can give performance boost (throughput) about 3-5% in normal cases and up to 2x in extreme cases with heavy disk load. But checkpoint time also increases about twice. Since half of the checkpoint time checkpoint buffer pages can't be released it leads to excessive checkpoint buffer usage, compared to approach with storing recovery data in WAL physical records. High checkpoint buffer usage enables throttling to protect from checkpoint buffer overflow. Old throttling algorithm (exponential backoff) is not suitable for new approach of storing recovery data, on the first phase of checkpoint it can completely stop all page modifying threads until recovery data written, so, the new throttling algorithm (based on fill rate of checkpoint buffer) for the new approach has been implemented and used in benchmarks.
WAL size is dramatically reduced (more then 10 times) with the new approach when there are a lot of unique pages modified between checkpoints. When the same pages modified frequently it still decreases, but about 2 times.
Enabling compression can reduce recovery data size up to 3 times and reduces recovery data write duration about twice.
Longer checkpoint time can lead to write throttling or even OOM if insufficient checkpoint buffer size is configured.
https://lists.apache.org/thread/dvzswwf4tm7mt2mwk3xsyd63dybydm53
[1] https://wiki.postgresql.org/wiki/Full_page_writes
[2] https://dev.mysql.com/doc/refman/8.0/en/innodb-doublewrite-buffer.html
[3] https://docs.oracle.com/en/database/oracle/oracle-database/19/bradv/rman-block-media-recovery.html