Motivation

For some datasets and applications (like Cloudberry), it is desirable to have the property that all disk components of the primary index and all secondary indexes of a dataset align on the same filter value boundaries. The benefit is that when a tuple is found at some component di of the secondary index, we can directly search the corresponding component di' of the primary index to fetch that tuple without checking other disk components.

Current Workflow of Flush/Merge

Currently, the workflow of the flush operation is as follows. After a transaction commits (insert/delete/upsert), if any memory component of any index of a dataset needs flush (i.e., is full), the primary index operation tracker would submit a flush request for all indexes of the dataset to the LSMIOOperationScheduler. That is, all indexes of a dataset would be flushed together, which means the newly generated disk components due to flush are always aligned.

For merge, whenever a new disk component is added for an index (due to flush or merge), the corresponding merge policy would be notified. The merge policy checks the existing disk components for an index, and if it decides some disk components need to be merged, it would submit the merge request to the LSMIOOperationScheduler. By default, the merge request is sent for each index independently. However, currently we have a CorrelatedPrefixPolicy which only checks the disk components of the primary index, and sends a corresponding merge request for all secondary indexes together when the primary index needs to be merged.

Correlated Merge Policy

Proposed Solution

...

Page tree

Versions Compared

Old Version 4

New Version Current

Key

Motivation

Current Workflow of Flush/Merge

Correlated Merge Policy

Proposed Solution

Deprecated

Page tree

Page History

Versions Compared

Old Version 4

New Version Current

Key

Motivation

Current Workflow of Flush/Merge

Correlated Merge Policy

Proposed Solution

Deprecated