Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Motivation

For some datasets and applications (like Cloudberry), it is desirable to have the property that all disk components of the primary index and all secondary indexes of a dataset align on the same filter value boundaries. The benefit is that when a tuple is found at some component di of the secondary index, we can directly search the corresponding component di' of the primary index to fetch that tuple without checking other disk components.

Current Workflow of Flush/Merge

Currently, the workflow of the flush operation is as follows. After a transaction commits (insert/delete/upsert), if any memory component of any index of a dataset needs flush (i.e., is full), the primary index operation tracker would submit a flush request for every index of the dataset to the ILSMIOOperationScheduler. That is, all indexes of a dataset would be flushed together, which means the newly generated disk components due to flush are always aligned.

For merge, whenever a new disk component is added for an index (due to flush or merge), the corresponding merge policy would be called. The merge policy checks the existing disk components for an index and if it decides some disk components need to be merged, it would submit the merge request to the LSMIOOperationScheduler. By default, the merge request is sent for each index independently. However, currently we have a CorrelatedPrefixPolicy which only checks the disk components of the primary index, and sends a corresponding merge request for ever secondary index together when the primary index needs to be merged.

ILSMIOOperationScheduler has two implementations, i.e., synchronous scheduler and asynchronous scheduler. Synchronous scheduler simply executes the flush/merge operation immediately in the caller's current thread. Thus, all flush/merge operations are executed serially for all indexes of a dataset. However, AsterixDB currently uses asynchronous scheduler, which maintains a thread pool for executing flush/merge operations. The only guarantee provided by asynchronous scheduler is that flush operations for an index are always executed serially. However, there is no restriction on the execution of merge operations.

Correlated Merge Policy

AsterixDB currently has a CorrelatedPrefixPolicy which tries to coordinate merge operations for all indexes of a dataset. It uses the same merge criteria with PrefixMergePolicy. However, the only difference is that it only checks the primary index, and when some disk components of the primary index need to be merged, it also merges the corresponding components (using the same positions of the disk component list) of all secondary indexes.

However, because the asynchronous scheduler provides no guarantee of the execution order of flush/merge operations, the current correlated policy is not working as expected. Consider the following example with primary index P and secondary indexes S1 and S2. Initially, all of them have three disk components.

  • P: D3, D2, D1
  • S1: D3, D2, D1
  • S2: D3, D2, D1

Now suppose the correlated policy wants to merge D3 and D2 (all D3 and D2 of P, S1 and S3 will be merged). When these merge operations are being executed, all indexes are flushed. Since there is no guarantee which operation would finish first, a possible execution could result in the following situation:

  • P: D4, D3+D2, D1
  • S1: D4, D3, D2, D1
  • S2: D3+D2, D1

Here the flush operation results in a new disk component D4, while merging D3 and D2 gives component "D3+D2". Here, P finishes both merge and flush, while S1 only finishes flush and S2 only finishes merge. Since P finishes flush, the correlated policy would be called again at this time, but it cannot coordinate the merge anymore since the disk components of P, S1, and S2 are no longer aligned at this moment.

Proposed Solution

The reason why the current correlated policy doesn't work is that there is no order when flush/merge operations are finished. To handle this, we propose to extend the current ILSMIOOperationScheduler with one more method scheduleOperationGroup. The semantics of the operation group is as follows:

  • operation groups of a dataset are always executed serially,
  • operations inside a group would be executed concurrently (each operation is executed in one thread),
  • the user can optionally provide a callback function when submitting an operation group, and the callback function would be called after all operations inside this group are finished.

Furthermore, datasets that need disk component alignment should make the following changes:

  • always submit the flush/merge operations for all indexes together using scheduleOperationGroup,
  • the merge policy should be called after an operation group is finished (as a callback function),
  • all newly created disk components should be activated together after an operation group is finished (as a callback function).

Thus, the revised workflow of datasets with alignment could be as follows:

  • P.flush -> callback -> P.flush -> callback -> P.merge -> callback -> P.flush -> callback -> ...
  • S1.flush -> callback -> S1.flush -> callback -> S1.merge -> callback -> S1.flush -> callback -> ...
  • S2.flush -> callback -> S2.flush -> callback -> S2.merge -> callback -> S2.flush -> callback -> ...

Slides

https://drive.google.com/file/d/0B43D7FK4h0PicmRodWZXVWJneFE/view?usp=sharing

...

Deprecated