Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.


IDIEP-113
Author
Sponsor
Created

 

Status
Status
colourGrey
titleDRAFT


Table of Contents

Motivation

...

Alternatively, we can write physical records (page snapshots and page delta records) the same way as we do now, but use different files for physical and logical WALs. In this case there will be no redundant reads/writes of physical records (physical WAL will not be archived, but will be deleted after checkpoint). This approach will reduce disk worload and don't increase checkpoint duration, but still extra data is required to be written as page delta records for each page modification and physical records can't be written in background.

Benchmarks

Full speed test

The goal of this test is to measure maximum throughput in cases without bottlenecks such as disk overload influence or throttling due to checkpoint buffer overflow.  


Test parameters:

ParameterValue
Server nodes count6
Client nodes count6
Range (unique keys)1_000_000
Data region size1Gb
Checkpoint period30s
Backups count1
Warmup60s
Duration after warmup600s


Benchmark results:

Parameter

Atomic puts

WAL

Atomic puts

CP recovery

Implicit Tx puts

WAL

Implicit Tx puts

CP recovery

Implicit Tx puts

FSYNC WAL mode

WAL

Implicit Tx puts

FSYNC WAL mode

CP recovery

Throughput (rps)799 878.67831 060.07360 502.78380 029.7870 865.2273 340.22
Latency (ms)0.35240.33780.80580.76444.12723.9844
WAL size (avg per node, bytes)41 112 562 15622 269 185 43332 076 973 84024 089 536 0236 839 126 8854 676 597 433
Checkpoint parameters (avg per node):
Total CP pages written  153 971  152 661  150 967  148 984  160 175  161 042
Total CP recovery data size (bytes)
 583 399 088
 568 974 574
 603 859 277
Total CP recovery data write duration (ms)
  2 375
  2 383
  1 931
Total CP pages write duration (ms)  1 787  1 784  1 829  1 858  1 416  1 368
Total CP fsync duration (ms)   864   811   909   825   870   820
Total CP duration (ms)  5 771  6 463  5 088  6 606  2 670  4 420

Fixed rate test

The goal of this test is to measure disk usage and checkpoint parameters on tests with the same workload.


Test parameters:

ParameterValue
Server nodes count6
Client nodes count6
Range (unique keys)1_000_000
RPS60_000
Data region size1Gb
Checkpoint period30s
Backups count1
Warmup60s
Duration after warmup600s


Benchmark results:

Parameter

Atomic puts

WAL

Atomic puts

CP recovery

Atomic puts

CP recovery

compression

DISABLED

Atomic puts

CP recovery

compression

SNAPPY

Implicit Tx puts

WAL

Implicit Tx puts

CP recovery

Implicit Tx puts

FSYNC WAL mode

WAL

Implicit Tx puts

FSYNC WAL mode

CP recovery

Throughput (rps)60 00060 00060 00060 00060 00060 00060 00060 000
Latency (ms)0.18900.18770.18830.18710.44610.44142.25841.8918
WAL size (avg per node, bytes)3 695 334 9001 610 914 1891 610 921 0671 610 914 0095 921 194 6313 828 517 9085 913 043 4353 828 528 343
Checkpoint parameters (avg per node):
Total CP pages written  158 714  165 068  165 540  165 833  168 161  165 078  163 835  163 489
Total CP recovery data size (bytes)
 618 070 915 681 737 645 225 645 987
 615 981 504
 602 789 507
Total CP recovery data write duration (ms)
  1 782  1 930   927
  1 873
  1 943
Total CP pages write duration (ms)  1 262  1 214  1 208  1 224  1 368  1 293  1 358  1 400
Total CP fsync duration (ms)   790   802   811   795   849   830   861   889
Total CP duration (ms)  2 454  4 079  4 253  3 225  2 682  4 322  2 589  4 541


Extreme load test

In this test 1_500_000_000 unique keys inserted into the cache during 2 hours. This test assumes high disk and checkpoint buffers usage. 


Test parameters:

ParameterValue
Server nodes count6
Client nodes count6
Range (unique keys)15_000_000_000
Data region size40Gb
Checkpoint buffer size10Gb
Checkpoint period30s
Backups count0
Duration7200s


Benchmark results:

Parameter

Atomic puts

WAL

tomic puts

CP recovery

compression

SNAPPY

Atomic puts

CP recovery

Throughput (rps)   349 414   731 955   667 503
Latency (ms)0.82120.37960.4095
WAL size (avg per node, bytes)1 223 677 262 243 107 160 078 079 97 724 078 037
Checkpoint parameters (avg per node):
Total CP pages written  154 280 598  402 636 438  363 780 153
Total CP buffer pages used  10 598 937  240 644 447  231 946 861
Total CP recovery data size (bytes)
 719 657 813 6461 413 758 409 104
Total CP recovery data write duration (ms)
  2 669 067  3 392 610
Total CP pages write duration (ms)  1 328 328  4 004 378  3 407 234
Total CP fsync duration (ms)   843 706   40 309   33 854
Total CP duration (ms)  2 219 940  6 801 086  6 906 536


Here are charts related to the benchmark (on the left - write recovery data in WAL physical records, in the center - write recovery data on checkpoint with compression SNAPPY, on the right - write recovery data on checkpoint with default SKIP_GARBAGE compression):

Latency and throughput chart (for batch of 100 puts)


Image Added

Checkpoint buffer usage chart:

Image Added

Checkpoint pages number chart:


Image Added

Throttling:

Image Added

Disk usage:

Image Added

Summary

As we can see in these tests storing recovery data on checkpoint can give performance boost (throughput) about 3-5% in normal cases and up to 2x in extreme cases with heavy disk load. But checkpoint time also increases about twice. Since half of the checkpoint time checkpoint buffer pages can't be released it leads to excessive checkpoint buffer usage, compared to approach with storing recovery data in WAL physical records. High checkpoint buffer usage enables throttling to protect checkpoint buffer overflow. Old throttling algorithm (exponential backoff) is not suitable for new approach of storing recovery data, on the first phase of checkpoint it can completely stop all page modifying threads until recovery data written, so, the new throttling algorithm (based on fill rate of checkpoint buffer) for the new approach has been implemented and used in benchmarks.

WAL size is dramatically reduced (more then 10 times) with the new approach when there are a lot of unique pages modified between checkpoints. When the same pages modified frequently it still decreases, but about 2 times.  

Enabling compression can reduce recovery data size up to 3 times and reduces recovery data write duration about twice. TBD

Risks and Assumptions

Longer checkpoint time can lead to write throttling or even OOM if insufficient checkpoint buffer size is configured. 

...