Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Status

Current state: "Under Discussion"

Discussion thread

JIRA: here (<- link to

Jira
serverASF JIRA
serverId5aa69414-a9e9-3523-82ec-879b028fb15b
keyFLINK-27982
)

...

In the field of data analysis, star-schema[1] is the simplest style of and most widely used data mart patterns. It's natural in a star schema to map one or more dimensions to partition columns. Detecting and avoiding unnecessary data scans is the most important responsibility of the optimizer. Currently, Flink supports static partition pruning: the conditions in the WHERE clause are analyzed to determine in advance which partitions can be safely skipped in the optimization phase. Another common scenario: the partitions information is not available in the optimization phase but in the execution phase. That's the problem this FLIP is trying to solve: dynamic partition pruning, which could reduce the partition table source IO.

...


In this FLIP we will introduce a mechanism for detecting dynamic partition pruning patterns in optimization phase and performing partition pruning at runtime by sending the dimension table results to the SplitEnumerator of fact table via existed coordinator existing coordinator mechanism.

In above example, the result from date_dim which d_year is 2000 will be send to join build side and the Coordinator, and the SplitEnumerator of store_returns will filter the partitions based on the filtered date_dim data, and send the real required splits to the source scan of store_returns.

...

Since the split enumerator is executed in JM, the filtered partition data from dim-source operator should be send to the split enumerator via existed coordinator existing coordinator mechanism. So we introduce DynamicPartitionEvent to wrap the partition data.

...

Currently, for FileSystem connector and Hive connector, the splits are created by FileEnumerator. However, existed existing FileEnumerators do not match the requirement, therefore we introduce a new FileEnumerator named DynamicFileEnumerator to support creating splits based on partition.

...