Discussion thread
Vote thread
Issue	https://github.com/apache/incubator-paimon/issues/741
Release	-

Motivation

In Paimon's SortMergeReader, the multiple RecordReader will be used and the same key will be merged using MergeFunction. Essentially, this process is to perform multi-way merging on the value of RecordReader, which is implemented by default in Paimon using heapsort, and the number of comparisons for each adjustment of the heap is 2logN. LoserTree is also mainly used for multi-way merging. Compared with heap sorting, each comparison only needs to be compared with the parent node, while the heapsort needs to be compared with the two child nodes. The number of comparisons of LoserTree(logN) is less than that of heapsort, so LoserTree can be considered for optimization.

Public Interfaces

There is no modification to the existing public interface, only the sorting algorithm is changed from heapsort to LoserTree.

Proposed Changes

Change the multi-way merge algorithm in SortMergeReader from heapsort to LoserTree. Like the regular LoserTree implementation, the sorting process consists of two parts: initialization and tree adjustment. In a conventional LoserTree implementation, it is only necessary to continuously fetch data from the head node and adjust the tree from the bottom up. In Paimon, since the data returned by the same RecordReader will reuse the same object, this will cause the previously returned object to be changed when the RecordReader iterates the data. The returned object may also be saved in MergeFunction, which will lead to incorrect calculation results. Although this problem can be solved by using deepCopy, the overhead is too high.

Therefore, it is necessary to provide a variant implementation of LoserTree: after the external merging of the same userKey is completed, the data of these RecordReader is iterated. The algorithm of LoserTree is as follows.

Preconditions

Each key consists of two parts: userKey and sequenceNumber, that is, two comparators will be provided;
Each RecordReader is ordered, and a single RecordReader does not contain the same userKey.

Initialization

Initialize LoserTree from the bottom up in the same way as regular LoserTree.

Tree Adjustment

When adjusting the tree, due to the problem of object reuse, we cannot directly iterate the RecordReader to the next data, so we need to mark the data first, similar to setting the sequenceNumber to infinite, and then adjust from the bottom up. In this way, nodes with the same userKey can be accessed eventually. Every time the tree is adjusted, the overhead of userKey comparison is relatively high. In the previous process of building or adjusting LoserTree, this userKey has already been compared, so we introduced a state machine to reduce duplicated comparisons.

State Definition

Here we define a total of 6 states to represent the different states of the node.

WINNER_WITH_NEW_KEY: new local winner, and use a different userKey from the previous winner;
WINNER_WITH_SAME_KEY: use the same userKey as the previous global winner, but the sequenceNumber is larger;
WINNER_POPPED: indicates that the global winner has been popped, and it is also used to determine whether there is an unprocessed same userKey in the tree;
LOSER_WITH_NEW_KEY: does not have the same userKey as the local winner who defeated it;
LOSER_WITH_SAME_KEY: It has the same userKey as the local winner who defeated it;
LOSER_POPPED: It has the same userKey as the last popped global winner, and the node has also been popped;

State Transition

When comparing two nodes and performing state transitions, follow the following rules:

When each leaf node iterates to generate a new Key, the state is initialized to WINNER_WITH_NEW_KEY;
When the head node of the tree is popped, the state of the corresponding leaf node is transition to WINNER_POPPED, which can be regarded as the userKey remains unchanged, but the sequenceNumber is set to infinity;
Depending on the state of the local winner node, different state transitions will be performed when encountering a parent node:
1. The local winner is WINNER_WITH_NEW_KEY, and the state of the parent node is:
  1. LOSER_WITH_NEW_KEY：Two nodes need to perform compare to calculate a new winner; if the userKey of the two nodes is the same, the loser state will transition to LOSER_WITH_SAME_KEY;
  2. LOSER_WITH_SAME_KEY：It is an impossible case. WINNER_WITH_NEW_KEY means that we have started a new round, so that all nodes in the tree with the same userKey as the last global winner should be popped.
  3. LOSER_POPPED：No need to compare, the parent node wins and transition to WINNER_POPED, and the child node transition to LOSER_WITH_NEW_KEY;
2. The local winner is WINNER_WITH_SAME_KEY, and the state of the parent node is:
  1. LOSER_WITH_NEW_KEY：No need to compare, the child node wins, no need to transition state;
  2. LOSER_WITH_SAME_KEY：only need to compare the sequenceNumber of two nodes, no need to compare userKey, reduce comparison overhead; the winner transition to WINNER_WITH_SAME_KEY, and the loser transition to LOSER_WITH_SAME_KEY
  3. LOSER_POPPED：No need to compare, the child node wins, no need to transition state;
3. The local winner is WINNER_POPED, and the state of the parent node is:

Optimization

According to the above process, a LoserTree variant can be implemented, but each time a winner is popped from the head node, the node still needs to be readjusted from the bottom up, which will cause unnecessary adjustments. In extreme cases, when there are no duplicated userKeys in the entire source, we need to do two tree adjustments after each pop of a key: one is to set the sequenceNumber to infinite, and the other is to iterate the RecordReader once. In this way, LoserTree may perform worse than heapsort.

By adding the firstSameKeyIndex field in the leaf node, we can record the position of the same userKey we won for the first time, so that we can quickly distinguish whether there is the same userKey in the tree. If so, we can directly transition the state of these two nodes, and start to adjust upwards from this position, thereby reducing the number of adjustments.

Algorithm Proof

In actual use, each round of iteration will merge all the same userKeys before iterating the RecordReader. Therefore, we only need to prove that all the data of the same userKey in this round will be returned.

- Theory: When the firstSameKeyIndex of the global winner is -1, there is no unprocessed node in the tree with the same userKey as the global winner.

- Proof: According to the definition of LoserTree, any of its subtrees is LoserTree. Assume that the current global winner comes from leaf node A, and there is a node in the tree that has the same userKey as the global winner but has not yet been popped, and it comes from leaf node B. The nearest common ancestor of A and B is node C, and A and B come from the left and right subtrees respectively.

It is known that A must participate in the comparison of node C. Since node B and node A have the same minimum userKey, then node B will either become the winner of the right subtree, or be defeated by a node with the same userKey. Finally, the winner of the right subtree of C must be the node with the same userKey as node A, so firstSameKeyIndex cannot be -1. This proves that when the firstSameKeyIndex of the global winner is -1, there will not be an unprocessed node with the same userKey of the global winner in the tree.

Compatibility, Deprecation, and Migration Plan

There are no changes to the public interface and no impact to existing users.
The old sorting algorithm is directly replaced by the new sorting algorithm, no compatibility issues are expected.

Test Plan

At present, we already have SortMergeReaderTestBase to verify that the result of multi-way merge is correct, and a test about LoserTree will be added later to verify that the result of sorting is correct.

Rejected Alternatives

None.

Page tree

PIP-2: Optimize SortMergeReader with LoserTree