ID

IEP-113

Author

Sponsor

Created

01 Nov 2023

Status


colour	Grey
title	DRAFT

Table of Contents

Motivation

...

Alternatively, we can write physical records (page snapshots and page delta records) the same way as we do now, but use different files for physical and logical WALs. In this case there will be no redundant reads/writes of physical records (physical WAL will not be archived, but will be deleted after checkpoint). This approach will reduce disk worload and don't increase checkpoint duration, but still extra data is required to be written as page delta records for each page modification and physical records can't be written in background.

Benchmarks

Full speed test

The goal of this test is to measure maximum throughput in cases without bottlenecks such as disk overload influence or throttling due to checkpoint buffer overflow.

Test parameters:

Parameter	Value
Server nodes count	6
Client nodes count	6
Range (unique keys)	1_000_000
Data region size	1Gb
Checkpoint period	30s
Backups count	1
Warmup	60s
Duration after warmup	600s

Benchmark results:

Parameter	Atomic puts WAL	Atomic puts CP recovery	Implicit Tx puts WAL	Implicit Tx puts CP recovery	Implicit Tx puts FSYNC WAL mode WAL	Implicit Tx puts FSYNC WAL mode CP recovery
Throughput (rps)	799 878.67	831 060.07	360 502.78	380 029.78	70 865.22	73 340.22
Latency (ms)	0.3524	0.3378	0.8058	0.7644	4.1272	3.9844
WAL size (avg per node, bytes)	41 112 562 156	22 269 185 433	32 076 973 840	24 089 536 023	6 839 126 885	4 676 597 433
Checkpoint parameters (avg per node):
Total CP pages written	153 971	152 661	150 967	148 984	160 175	161 042
Total CP recovery data size (bytes)		583 399 088		568 974 574		603 859 277
Total CP recovery data write duration (ms)		2 375		2 383		1 931
Total CP pages write duration (ms)	1 787	1 784	1 829	1 858	1 416	1 368
Total CP fsync duration (ms)	864	811	909	825	870	820
Total CP duration (ms)	5 771	6 463	5 088	6 606	2 670	4 420

Fixed rate test

The goal of this test is to measure disk usage and checkpoint parameters on tests with the same workload.

Test parameters:

Parameter	Value
Server nodes count	6
Client nodes count	6
Range (unique keys)	1_000_000
RPS	60_000
Data region size	1Gb
Checkpoint period	30s
Backups count	1
Warmup	60s
Duration after warmup	600s

Benchmark results:

Parameter	Atomic puts WAL	Atomic puts CP recovery	Atomic puts CP recovery compression DISABLED	Atomic puts CP recovery compression SNAPPY	Implicit Tx puts WAL	Implicit Tx puts CP recovery	Implicit Tx puts FSYNC WAL mode WAL	Implicit Tx puts FSYNC WAL mode CP recovery
Throughput (rps)	60 000	60 000	60 000	60 000	60 000	60 000	60 000	60 000
Latency (ms)	0.1890	0.1877	0.1883	0.1871	0.4461	0.4414	2.2584	1.8918
WAL size (avg per node, bytes)	3 695 334 900	1 610 914 189	1 610 921 067	1 610 914 009	5 921 194 631	3 828 517 908	5 913 043 435	3 828 528 343
Checkpoint parameters (avg per node):
Total CP pages written	158 714	165 068	165 540	165 833	168 161	165 078	163 835	163 489
Total CP recovery data size (bytes)		618 070 915	681 737 645	225 645 987		615 981 504		602 789 507
Total CP recovery data write duration (ms)		1 782	1 930	927		1 873		1 943
Total CP pages write duration (ms)	1 262	1 214	1 208	1 224	1 368	1 293	1 358	1 400
Total CP fsync duration (ms)	790	802	811	795	849	830	861	889
Total CP duration (ms)	2 454	4 079	4 253	3 225	2 682	4 322	2 589	4 541

Extreme load test

In this test 1_500_000_000 unique keys inserted into the cache during 2 hours. This test assumes high disk and checkpoint buffers usage.

Test parameters:

Parameter	Value
Server nodes count	6
Client nodes count	6
Range (unique keys)	15_000_000_000
Data region size	40Gb
Checkpoint buffer size	10Gb
Checkpoint period	30s
Backups count	0
Duration	7200s

Benchmark results:

Parameter	Atomic puts WAL	tomic puts CP recovery compression SNAPPY	Atomic puts CP recovery
Throughput (rps)	349 414	731 955	667 503
Latency (ms)	0.8212	0.3796	0.4095
WAL size (avg per node, bytes)	1 223 677 262 243	107 160 078 079	97 724 078 037
Checkpoint parameters (avg per node):
Total CP pages written	154 280 598	402 636 438	363 780 153
Total CP buffer pages used	10 598 937	240 644 447	231 946 861
Total CP recovery data size (bytes)		719 657 813 646	1 413 758 409 104
Total CP recovery data write duration (ms)		2 669 067	3 392 610
Total CP pages write duration (ms)	1 328 328	4 004 378	3 407 234
Total CP fsync duration (ms)	843 706	40 309	33 854
Total CP duration (ms)	2 219 940	6 801 086	6 906 536

Here are charts related to the benchmark (on the left - write recovery data in WAL physical records, in the center - write recovery data on checkpoint with compression SNAPPY, on the right - write recovery data on checkpoint with default SKIP_GARBAGE compression):

Latency and throughput chart (for batch of 100 puts)

Image Added

Checkpoint buffer usage chart:

Image Added

Checkpoint pages number chart:

Image Added

Throttling:

Image Added

Disk usage:

Image Added

Summary

As we can see in these tests storing recovery data on checkpoint can give performance boost (throughput) about 3-5% in normal cases and up to 2x in extreme cases with heavy disk load. But checkpoint time also increases about twice. Since half of the checkpoint time checkpoint buffer pages can't be released it leads to excessive checkpoint buffer usage, compared to approach with storing recovery data in WAL physical records. High checkpoint buffer usage enables throttling to protect checkpoint buffer overflow. Old throttling algorithm (exponential backoff) is not suitable for new approach of storing recovery data, on the first phase of checkpoint it can completely stop all page modifying threads until recovery data written, so, the new throttling algorithm (based on fill rate of checkpoint buffer) for the new approach has been implemented and used in benchmarks.

WAL size is dramatically reduced (more then 10 times) with the new approach when there are a lot of unique pages modified between checkpoints. When the same pages modified frequently it still decreases, but about 2 times.

Enabling compression can reduce recovery data size up to 3 times and reduces recovery data write duration about twice. TBD

Risks and Assumptions

Longer checkpoint time can lead to write throttling or even OOM if insufficient checkpoint buffer size is configured.

...

Page tree

Versions Compared

Old Version 2

New Version 3

Key

Motivation