Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

This page describes the merge policies and merge schedulers used in AsterixDB. A merge policy is used to decide which disk components should be merged, i.e., creating merge operations. A merge scheduler is used to execute the merge operations created by the merge policy. In general, the merge policy controls the overall I/O cost of reads and writes, and the merge scheduler mainly impacts write latencies.

Merge Policy

AsterixDB supports a number of merge policies based on full merges [1]. In full merges, disk components (per storage partition) are ordered based on recency and are always merged in full. Whenever a new disk component is created, either by a flush or a merge, the merge policy will be called to see whether more merges can be created.

Concurrent Merge Policy (default)

The concurrent merge policy is the default merge policy for the current master. It has four important parameters:

  • minMergeComponentCount: the mimimum disk components per merge (default: 3)
  • maxMergeComponentCount: the maximum disk components per merge (default: 10)

  • maxComponentConutmaxComponentCount: the maximum number of disk components allowed before stopping flushes (default: 30)
  • size ratio: explained below (default: 1.2).

Suppose we have a list of disk components D1, D2, ..., DN ordered from oldest to newest. The basic idea for this merge policy is to merge as many disk components as possible at once (not exceeding maxMergeComponentCount) but . It uses the size ratio to make optimal trade-offs. It will trigger as a merge as follows:

  • Start with the oldest disk component D1. Look at the longest younger disk component sequence D2, D3, ..., DK. If the length of D2, D3,...,DK >= minMergeComponentCount and the total size of D2, D3, ..., DK * sizeRatio >= the size of D1, trigger a merge of D1, D2, ..., DK.
  • If the merge cannot be triggered for D1, then report the repeat the above step for D2.

As the name suggests, this merge policy performs concurrent merges, i.e., there may be multiple created merges within a time period. One important constraint here is that we can never merge a disk component that is already being merged. Thus, to support concurrent merges, the above merge decision process is refined by starting with the oldest component where no older disk component is being merged.

Consider the example depicted in the figure below. Suppose at most 4 disk components can be merged at once. The first call of this merge policy will create a merge of components 10GB, 10GB, 5GB, and 5GB. The oldest 100GB is excluded because the the combination of younger components is not large enough to trigger a merge. The second call of this merge policy would start from the component labeled 1GB, and trigger a merge of components 128MB, 96MB, and 64MB.

Image Added


The size ratio controlls controls some interesting performance trade-offs. By setting the size ratio < > 1, it gives a leveling style merge policy[1]. Disk components There will be merged more frequently and the total number a list of disk components will be reducedwith exponentially-increasing sizes. By setting the size ratio > < 1, it gives a tiering style merge policy[2]. There will be more disk components but write performance is improved.

As the name suggests, this merge policy performs concurrent merges. The above merge decision process is refined by starting with the oldest component where no older disk component is being merged. For example, suppose we have D1, D2, D3, D4, D5 and D2+D3 are being merged. The next call of this merge policy will start looking from D4 (and later D5).

1].  The expected performance trends of write throughput are illustrated below.


Image Added

Image Added


In general, increasing the size ratio will increase the merge frequency and reduce the number of disk components. Thus, this will increase the write throughput but decrease query performance. To see this, consider two extremes. When the size ratio is set at 0, then no merge will be performed at all. When the size ratio is set at an extremely large value, this merge policy will try to maintain just one giant disk component. Whenever the number of disk components reaches "minMergeComponentCount", these disk components will be merged together into one. Under an update-heavy workload, the size ratio also controls the space utilization. More disk components will result in lower space utilization because of more obsolete records (which are not eliminated until merges occur).

More explanation of these trade-offs can be found in [1]. More explanation of this merge policy can be found in [2] More explanation of this process can be found in my performance stability paper [2] (Section 5.3). It should be noted that this merge policy is highly dynamic and non-deterministic. Don't be suprised if different partitions have different storage layout layouts when using this merge policy.

Prefix Merge Policy

The prefix merge policy is similar to the concurrent merge policy but has some important differences:

  • The prefix merge policy has an additional parameter "maxMergableComponentSize" to controll control the maximum disk component size. All disk comopnents larger components larger than this parameter will be excluded from future merging.
  • Only one merge is scheduled at a time (as opposed to concurrent merges).
  • The flow control mechanism (i.e., deciding when to stop flushes) is different: if when there is an ongoing merge but an extra merge can be scheduled, then flushes will be stopped. In contrast, the concurrent merge policy blocks flushes when the total number of disk components is larger than maxComponentCount.

In general, the prefix merge policy (with its merge cap) should not be used for an update-heavy workload because we have to keep merging to clean up obsolete records. However, this policy will can be helpful for an temporal and appendlyappend-mostly workloads (with range filters)

Constant Merge Policy

The constant merge policy has been recently improved to incoporate incorporate a very interesting theorical theoretical study [3]. This policy is designed for append-mostly workloads and only requires a one parameter "num-components" to control the total number of disk components. I won't describe the full detail here (because I don't completely understand the analysis) but it It has been shown that this merge policy has optimal write cost when the  number of disk components is fixed and the workload is append-only.

Merge Scheduler

Greedy Scheduler (default)

The greedy scheduler always allocates the full I/O bandwidth to the merge operation with the smallest number of pages of an LSM-tree. It has been shown by [2] that the greedy scheduler is useful for reducing the number of disk components over time, thus minimizing write stalls and improving query performance.

Note1: the greedy scheduler shouldn't be used when measuring the maximum write throughput of the system. Otherwise, it would report a higher but unsustainable write throughput by starving large merges. Instead, the async scheduler should be used during performance testing.

Note2: the greedy scheduler is only effective when used together with the concurrent merge policy. Other merge policies will only create one merge operation at a time, which makes the greedy scheduler useless.

Async Scheduler

The async scheduler allocates the I/O bandwidth evenly to all ongoing merges of an LSM-tree. This ensures the fairness among all merges. Even though this is not a good scheduler at runtime, the async scheduler should be used when measuring the maximum write throughput of the same because it avoids starvation.

[1] Chen Luo, Michael J. Carey. LSM-based Storage Techniques: A Survey, https://arxiv.org/abs/1812.07527.

...