This page describes the merge policies and merge schedulers used in AsterixDB. A merge policy is used to decide which disk components should be merged, i.e., creating merge operations. A merge scheduler is used to execute the merge operations created by the merge policy. In general, the merge policy controls the overall I/O cost of reads and writes, and the merge scheduler mainly impacts write latencies.

Merge Policy

AsterixDB supports a number of merge policies based on full merges [1]. In full merges, disk components (per storage partition) are ordered based on recency and are always merged in full. Whenever a new disk component is created, either by a flush or a merge, the merge policy will be called to see whether more merges can be created.

Concurrent Merge Policy (default)

The concurrent merge policy is the default merge policy for the current master. It has four important parameters:

minMergeComponentCount: the mimimum disk components per merge (default: 3)
maxMergeComponentCount: the maximum disk components per merge (default: 10)
maxComponentCount: the maximum number of disk components allowed before stopping flushes (default: 30)
size ratio: explained below (default: 1.2).

Suppose we have a list of disk components D1, D2, ..., DN ordered from oldest to newest. The basic idea for this merge policy is to merge as many disk components as possible at once (not exceeding maxMergeComponentCount). It uses the size ratio to make optimal trade-offs. It will trigger as a merge as follows:

Start with the oldest disk component D1. Look at the longest younger disk component sequence D2, D3, ..., DK. If the length of D2, D3,...,DK >= minMergeComponentCount and the total size of D2, D3, ..., DK * sizeRatio >= the size of D1, trigger a merge of D1, D2, ..., DK.
If the merge cannot be triggered for D1, then report the above step for D2.

As the name suggests, this merge policy performs concurrent merges, i.e., there may be multiple created merges within a time period. One important constraint here is that we can never merge a disk component that is already being merged. Thus, to support concurrent merges, the above merge decision process is refined by starting with the oldest component where no older disk component is being merged.

Consider the example depicted in the figure below. Suppose at most 4 disk components can be merged at once. The first call of this merge policy will create a merge of components 10GB, 10GB, 5GB, and 5GB. The oldest 100GB is excluded because the the younger components are not large enough to trigger a merge. The second call of this merge policy would start from the component labeled 1GB, and trigger a merge of components 128MB, 96MB, and 64MB.

The size ratio controls some interesting performance trade-offs. By setting the size ratio > 1, it gives a leveling style merge policy[1]. There will be a list of disk components with exponentially-increasing sizes. By setting the size ratio < 1, it gives a tiering style merge policy[1]. The expected performance trends of write throughput are illustrated below.

In general, increasing the size ratio will increase the merge frequency and reduce the number of disk components. Thus, this will increase the write throughput but decrease query performance. To see this, consider two extremes. When the size ratio is set at 0, then no merge will be performed at all. When the size ratio is set at an extremely large value, there will be just one giant disk component. This giant disk component will be merged with the newly flushed component whenever a flush completes. Under an update-heavy workload, the size ratio also controls the space utilization. More disk components will result in lower space utilization because of more obsolete records.

More explanation of these trade-offs can be found in [1]. More explanation of this merge policy can be found in [2] (Section 5.3). It should be noted that this merge policy is highly dynamic and non-deterministic. Don't be suprised if different partitions have different storage layouts when using this merge policy.

Prefix Merge Policy

The prefix merge policy is similar to the concurrent merge policy but has some important differences:

The prefix merge policy has an additional parameter "maxMergableComponentSize" to control the maximum disk component size. All disk compnents larger than this parameter will be excluded from future merging.
Only one merge is scheduled at a time (as opposed to concurrent merges).
The flow control mechanism (i.e., deciding when to stop flushes) is different: when there is an ongoing merge but an extra merge can be scheduled, then flushes will be stopped. In contrast, the concurrent merge policy blocks flushes when the total number of disk components is larger than maxComponentCount.

In general, the prefix merge policy (with its emrge cap) should not be used for an update-heavy workload because we have to keep merging to clean up obsolete records. However, this policy can be helpful for a temporal and append-mostly workloads (with range filters)

Constant Merge Policy

The constant merge policy has been recently improved to incorporate a very interesting theoretical study [3]. This policy is designed for append-mostly workloads and only requires one parameter "num-components" to control the total number of disk components. It has been shown that this merge policy has optimal write cost when the number of disk components is fixed and the workload is append-only.

Merge Scheduler

Greedy Scheduler (default)

The greedy scheduler always allocates the full I/O bandwidth to the merge operation with the smallest number of pages of an LSM-tree. It has been shown by [2] that the greedy scheduler is useful for reducing the number of disk components over time, thus minimizing write stalls and improving query performance.

Note1: the greedy scheduler shouldn't be used when measuring the maximum write throughput of the system. Otherwise, it would report a higher but unsustainable write throughput by starving large merges. Instead, the async scheduler should be used during performance testing.

Note2: the greedy scheduler is only effective when used together with the concurrent merge policy. Other merge policies will only create one merge operation at a time, which makes the greedy scheduler useless.

Async Scheduler

The async scheduler allocates the I/O bandwidth evenly to all ongoing merges of an LSM-tree. This ensures the fairness among all merges. Even though this is not a good scheduler at runtime, the async scheduler should be used when measuring the maximum write throughput of the same because it avoids starvation.

[1] Chen Luo, Michael J. Carey. LSM-based Storage Techniques: A Survey, https://arxiv.org/abs/1812.07527.

[2] Chen Luo, Michael J. Carey. On Performance Stability in LSM-based Storage Systems, https://arxiv.org/abs/1906.09667.

[3] Claire Mathieu, Carl Staelin, Neal E. Young, Arman Yousefi. Bigtable Merge Compaction. https://arxiv.org/abs/1407.3008.

Page tree

Merge Policies and Schedulers in AsterixDB