Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

We intend to introduce the following new configuration parameters.

Key

Type

Default Value

Description

table.optimizer.runtime-filter.enabled

Boolean

false

A flag to enable or disable the runtime filter.

table.optimizer.runtime-filter.max-build-data-size

MemorySize

10 MB

Data volume threshold of the runtime filter build side. Estimated data volume needs to be under this value to try to inject runtime filter.

table.optimizer.runtime-filter.min-probe-data-size

MemorySize

10 GB

Data volume threshold of the runtime filter probe side. Estimated data volume needs to be over this value to try to inject runtime filter.

table.optimizer.runtime-filter.min-filter-ratio

Double

0.5

Filter ratio threshold of the runtime filter. Estimated filter ratio needs to be over this value to try to inject runtime filter.

Proposed Changes

The Whole Workflow for Runtime Filter

...

In runtime part, we need to add two new operators, RuntimeFilterBuilderOperator and RuntimeFilterOperator. When scheduling, the scheduler will schedule based on topological order. The first is the build side, then the RuntimeFilterBuilder, and then the RuntimeFilter(The built filter is sent via the data edge from RuntimeFilterBuilder to RuntimeFilter).


CREATE TABLE dim (

  x INT,

  y INT,

  z BIGINT);

CREATE TABLE fact (

  a INT,

  b BIGINT,

  c INT,

  d VARCHAR,

  e BIGINT);

SELECT * FROM fact, dim WHERE x = a AND z = 2;

Next, we use the above query(the dim is small table, and the fact is large table) as an example to show whole workflow of runtime filter(Figure 1 shows the planner changes, Figure 2 shows the whole work flow):

...

  1. The estimated data volume of build side needs to be under the value of “table"table.optimizer.runtime-filter.max-build-data-threshold”size". If the data volume of build side is too large, the building overhead will be too large, which may lead to a negative impact on job performance.
  2. The estimated data volume of probe side needs to be over the value of “table"table.optimizer.runtime-filter.min-probe-data-threshold”size". If the data volume on the probe side is too small, the overhead of building runtime filter is not worth it.
  3. The filtering ratio needs to be over the value of “table.optimizer.runtime-filter.filter-ratio-threshold”. Same as 2, a low filter rate is not worth it to build runtime filter. The filter ratio can be estimated by the following formula:

...

RuntimeFilterBuilderOperator is responsible for building a bloom filter based on data from the build side. When building the bloom filter, we should also compare "table.optimizer.runtime-filter.max-build-data-threshold size" with the size of received data. Once this limit is exceeded, it will output a fake filter(which always returns true). The parallelism of this operator will be 1, and the built bloom filter will be broadcast to all related RuntimeFilterOperator.

...