Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

This page describes the merge policies and merge schedulers used in AsterixDB. A merge policy is used to decide which disk components should be merged, i.e., creating merge operations. A merge scheduler is used to execute the merge operations created by the merge policy. In general, the merge policy controls the overall I/O cost of reads and writes, and the merge scheduler mainly impacts write latencies.

Merge Policy

AsterixDB supports a number of merge policies based on full merges [1]. In full merges, disk components (per storage partition) are ordered based on recency and are always merged in full. Whenever a new disk component is created, either by a flush or a merge, the merge policy will be called to see whether more merges can be created.

...

  • Start with the oldest disk component D1. Look at the longest younger disk component sequence D2, D3, ..., DK. If the length of D2, D3,...,DK >= minMergeComponentCount and the total size of D2, D3, ..., DK * sizeRatio >= the size of D1, trigger a merge of D1, D2, ..., DK.
  • If the merge cannot be triggered for D1, then report the repeat the above step for D2.

As the name suggests, this merge policy performs concurrent merges, i.e., there may be multiple created merges within a time period. One important constraint here is that we can never merge a disk component that is already being merged. Thus, to support concurrent merges, the above merge decision process is refined by starting with the oldest component where no older disk component is being merged.

Consider the example depicted in the figure below. Suppose at most 4 disk components can be merged at once. The first call of this merge policy will create a merge of components 10GB, 10GB, 5GB, and 5GB. The oldest 100GB is excluded because the the combination of younger components are is not large enough to trigger a merge. The second call of this merge policy would start from the component labeled 1GB, and trigger a merge of components 128MB, 96MB, and 64MB.

...

In general, increasing the size ratio will increase the merge frequency and reduce the number of disk components. Thus, this will increase the write throughput but decrease query performance. To see this, consider two extremes. When the size ratio is set at 0, then no merge will be performed at all. When the size ratio is set at an extremely large value, there will be this merge policy will try to maintain just one giant disk component. This giant disk component will be merged with the newly flushed component whenever a flush completesWhenever the number of disk components reaches "minMergeComponentCount", these disk components will be merged together into one. Under an update-heavy workload, the size ratio also controls the space utilization. More disk components will result in lower space utilization because of more obsolete records (which are not eliminated until merges occur).

More explanation of these trade-offs can be found in [1]. More explanation of this merge policy can be found in [2] (Section 5.3). It should be noted that this merge policy is highly dynamic and non-deterministic. Don't be suprised if different partitions have different storage layouts when using this merge policy.

...

  • The prefix merge policy has an additional parameter "maxMergableComponentSize" to control the maximum disk component size. All disk compnents larger components larger than this parameter will be excluded from future merging.
  • Only one merge is scheduled at a time (as opposed to concurrent merges).
  • The flow control mechanism (i.e., deciding when to stop flushes) is different: when there is an ongoing merge but an extra merge can be scheduled, then flushes will be stopped. In contrast, the concurrent merge policy blocks flushes when the total number of disk components is larger than maxComponentCount.

In general, the prefix merge policy (with its emrge capmerge cap) should not be used for an update-heavy workload because we have to keep merging to clean up obsolete records. However, this policy can be helpful for a temporal and append-mostly workloads (with range filters)

...